What's the world record for computer uptime?
September 20, 2006 3:59 PM   Subscribe

What's the world record for computer uptime?

I'm going to guess that there's never been any sort of official record taken, so any suggestions or leads are appreciated. For the sake of our sanity, let's set the following criteria:

1) Must be (or have been) running an operating system of some kind (so no vacuum tube computers.)
2) Actually, to further rule out ultra-antique processors, let's say that the term "crash" must make sense when applied to the computer. (Pocket calculators don't really crash, nor do abaci. Abacuses? Abaci.)

I'm sure there are other criteria that will rule out more silly responses, but I can't think of them at the moment, so this will do for now.

Anyone willing to hazard a guess?
posted by tweebiscuit to Computers & Internet (27 answers total) 4 users marked this as a favorite
There are 100% uptime systems that NEVER go down - multiple power backups, multiple cores that can be hot-swapped, nodes that can be updated individually... There are computers that have never been turned off since they've been turned on.
posted by jedrek at 4:01 PM on September 20, 2006

Here's a good place to start
posted by hindmost at 4:11 PM on September 20, 2006

Are embedded computers elegible, such as the ones in microwave ovens, VCRs, cars etc?
posted by -harlequin- at 4:11 PM on September 20, 2006

I'd guess that because of Moore's law, anything old that's not specifically a museum item has probably been replaced "recently".

old "big" computers (govt, big enterprise) it would have interfaces so obsolete that somewhere in the last 15 years it became impractical to maintain. "small" ones (say, the POS terminal in a video rental in some corner), have probably been through at least a blackout recently.

So, you're probably looking for:
(1) a museum item that somehow is kept running for uptime's sake. In this case, you could eventually find it in google, or
(2) A mainframe in some place which needs 100% uptime, but people don't care waiting 2 minutes for each query in their green phosphor terminals (check your nearest DMV :P)
(3) Something with an uptime of less than 15 years, which could be everywhere with a no-break. Good luck.

PS: the 15 years estimate is arbitrary.
posted by qvantamon at 4:15 PM on September 20, 2006

Actually, you're probably looking for:
(1) Telephone company switching equipment, where you may find computing devices with uptimes of 20-30 years.

(Admittedly, it's going to be a rare device even then. Several generations of switching equipment have come and gone in the interim.)
posted by majick at 4:22 PM on September 20, 2006

On second reading, I think the question may be somewhat misguided - it sounds like the criteria for uptime failure is a crash (ie an internal error), rather than a power failure, yet computers can only crash if there are bugs in the software (or hardware) - unanticipated machine states. Plenty of computers do not have any bugs in the hardware or software, and so will never crash - their uptime ends with either power outtage that has nothing to do with the computer, or hardware failure due to the corrosion of time on the parts.

Actually, a bug-free system can crash - in rare events, cosmic radiation can flip a bit, which could cause a bug. There isn't much you can do about that though - put the computer many miles underground will help, as will using discrete (ie large) transistors instead of microchips, so that most photons don't pack enough energy to flip a bit.
posted by -harlequin- at 4:38 PM on September 20, 2006

I imagine the systems on Voyager or the Pioneer probe might be up there.
posted by Civil_Disobedient at 5:00 PM on September 20, 2006

-harlequin- wouldn't several bits need to be flipped, in a rather unlikely fashion, to actually cause an error?
posted by phrontist at 5:12 PM on September 20, 2006

I was going to say Voyager too but the MI kind of eliminated embedded systems. Besides voyager spent a lot of it's time a sleep.

One other big iron system that might have rediculous uptime is air traffic control. They have a very long replacement cycle.
posted by Mitheral at 5:18 PM on September 20, 2006

Just a quick follow-up: Pioneer 10 was launched 3/10/72, and the last signal we received from it was 1/23/03. That's thirty-odd years.

Voyager 1 & 2, on the other hand, were launched in 1977 and are expected to remain operational until approximately 2020.
posted by Civil_Disobedient at 5:27 PM on September 20, 2006 [1 favorite]

phrontist: I imagine that if a piece of code is going to store or write a value to a certain location, and the address value (that the code looks at to find which location to write to) gets a single bit flipped, Bad Stuff could happen :)

But yes, it's not exactly a common problem. At least, not here down on the planet surface :)
posted by -harlequin- at 5:50 PM on September 20, 2006

"One other big iron system that might have rediculous uptime is air traffic control. They have a very long replacement cycle."
posted by Mitheral at 8:18 PM EST on September 20 [+fave] [!]

Actually, in the U.S. ATC systems go down a lot, and all regional centers still have tools for writing "strips" (basically, control tagged flight summaries), delivering them to control positions, and updating raw radar plots with little plastic "boat" markers and grease pens.
posted by paulsc at 6:02 PM on September 20, 2006 [1 favorite]

Response by poster: Cool information everyone -- thanks!

And yes, Harlequin, you're absolutely right -- I should have been taking power outages into account as well (not sure why I didn't.) My main goal was to eliminate computers on the level of calculators. Perhaps a better criteria would be "must have been running at least one calculation at all times for the duration of the period in question." (Thus excluding a very simple computer which calculates nothing until it receives input -- this might exclude most embedded circuits as well.)
posted by tweebiscuit at 6:10 PM on September 20, 2006

Phone switches are for sure candidates. I remember when one of the 5ESSes at AT&T (where I worked at the time) crashed one day due to a missed patch....people acted like Martians had just invaded. It was a HUGE deal.

I bet the guys at Tandem would know.
posted by popechunk at 6:22 PM on September 20, 2006 [1 favorite]

Single bit flippery is what the ECC RAM in your server is there to deal with.

Surprisingly, most particle-hit bit-flipping is due to radioactive decay in IC packaging, and not cosmic rays; putting systems deep underground might not help as much as you might expect.
posted by flabdablet at 6:23 PM on September 20, 2006

Harlequin: do you have any references for hardware/software that is 100% bug free?
posted by rsanheim at 6:36 PM on September 20, 2006

I never believed what I thought were apocryphal stories of running Netware servers being bricked up in old closets during remodeling projects, until I came across (in 2003) an old Compaq 386/16 Netware 3.11 print server, that had been running unattended, with its UPS, behind a big poster from 1996, in a supply closet for the marketing department of a sister company. The batteries in the UPS had finally died, and the server had tanked on a power interruption/spike, taking down service to some old large format HP plotters and printers that occasionally still did trade show layouts and other low volume drawings.

Took me the better part of a day to find it, too, by the usual method: finding one too many patch cords plugged into a panel feeding it in the central server room, and following the old Cat 3 wire. Lord knows who decided to put it in the marketing department closet, when, or why, but it had happily chugged away for, I suppose, years, with nary a peep from before I worked there. Might have been down a week or two when it became the object of investigation, because someone wanted a plot, and couldn't get it.

Nothing special about it, when I looked at it, either, except it was pretty filthy, and had an old 40 meg SCSI drive. I blew it out a little, put it on a different UPS, rebooted it, and put the poster back in front of it, while the users got back to running their plot.
posted by paulsc at 6:47 PM on September 20, 2006 [4 favorites]

Formal verification.
posted by Chuckles at 6:58 PM on September 20, 2006

Chuckles: I don't see how that allows for "proving 100% bug free software":
It is impossible to prove or test that a system has "no defect" since it is impossible to formally specify what "no defect" means. All that can be done is prove that a system does not have any of the defects that can be thought of, and has all of the properties that together make it functional and useful.
There is always some potential edge case or crazy combination that can get you.

Related: The nature of scientific "proof" and the development of software
posted by rsanheim at 8:05 PM on September 20, 2006

The way the wiki quote is phrased leaves the debate open to semantic arguments. They are using "system" to imply a physical implementation, but I think the word applies equally to abstractions.

Clearly models are not reality. So, you have issues like random errors, deterministic model inaccuracies, and inadequate theory.. However, you can prove the mathematical correctness of algorithms (for specific instances, but not in general - halting problem, and all that).

Anyway, more to your point, I think..

There is nothing fundamental about engineering software - as opposed to bridges - that makes creating robust systems harder. Economics and historical reasons caused software to become what it is. But that article you link.. On a quick skim, it reads like "Our software is bug ridden, but it isn't our fault."

It took civil engineers a long time to stop killing people needlessly, but they have done a pretty good job in recent years. I suggest that this is because civil engineers claimed responsibility for the mistakes of their trade, instead of blaming "the nature of scientific proof".
posted by Chuckles at 8:53 PM on September 20, 2006

I suspect that space probes have watchdog timers or other automatically-reboot-if-something-breaks mechanisms; just because it's still operational after spending thirty years in space doesn't mean it didn't crash in that time.

I second popechunk: one of the high-availability (Tandem, et al) phone switches is probably the longest-uptime computer by a reasonable definition.
posted by hattifattener at 11:44 PM on September 20, 2006

There is nothing fundamental about engineering software - as opposed to bridges - that makes creating robust systems harder.

This claim is often stated in solemn tones as if it were self-evident. I think it's far from self-evident. Unlike bridges, software complexity is unconstrained by physics.

Louis Savain disagrees, but his proposed solution strikes me as throwing out a fairly substantial amount of baby in pursuit of better bathwater.
posted by flabdablet at 2:36 AM on September 21, 2006

5 years + days from the uptime project as linked by hindmost is the most I have seen documented. I think the problem for most servers that are actually in use is that you need to restart the thing to update your OS. When faced with installing critical security patches vs. taking down the server for a few minutes, admins choose to updates. The longest (supposed) windows box is around 4 years, it runs Win2000 unpatched and is just a peice of crap computer sitting in a datacenter doing nothing. The #1 Computer on uptime.net.

I'm not familiar with mainframe server architecture, but I am guessing there is a way to update OS software without reboot?
posted by sophist at 3:22 AM on September 21, 2006

Strange that someone would get upset and suspicious on hearing that a Windows server has been up a long time.

But anyway, at a place I worked I set up a machine to do some MySQL and general testing on running Windows NT Server 4.0. It served a few other bits and bobs to the outside world, like a Quake server for a while and the like. When we closed the office I looked at the uptime out of interest and it was around 222days. Somewhat pedestrian compared to what’s been relayed here but surprised me all the same.
posted by ed\26h at 5:04 AM on September 21, 2006

If actual software bugs are the criteria, and not elective downtime for updates/upgrades, the windows 2000 install that I have for general home use has an current uptime of about a year. The hardware's years old, and after the break-in period(I built it myself), I've only had 3 or 4 actual crashes in the life of the machine, and the last time it went down was from Katrina. The "if it ain't broke, don't fix it" wisdom is mostly what keeps me from installing XP.
posted by Mr. Gunn at 6:10 AM on September 21, 2006

Telecom switches (5ESS, Tandems, etc) do crash and do go down, just not very often. In general they do approach uptimes of 99.999% (this is 5 minutes of downtime per year). If you figure there's ~10,000 of these switches worldwide, it would seem plausible that there might be a handful of these systems that have been running continuously without a crash for several years. A typical "crash" on these systems will have downtime of 30 minutes to a couple hours, implying that for every crash, the system would need to be operational for 6-24 years.
posted by forforf at 6:37 AM on September 21, 2006

I have lots of thoughts on the software reliability end, but I think it is time for me to shut up about that..

I suspect that space probes have watchdog timers or other automatically-reboot-if-something-breaks mechanisms;

This is all about where the arbitrary boundries of your system are. As long as the thing keeps doing what I want it to do, isn't that uptime? So what if it gets itself caught somewhere, if it is robust enough to detect the error and come back online and do its job..

When faced with installing critical security patches vs. taking down the server for a few minutes, admins choose to updates.

I was originally thinking about how requirements cause downtime (both elective and accidental). If your box is doing a very particular job with clear and fixed requirements, you can have extraordinarily long uptime.. Which just leads back to the software reliability debate again :P
posted by Chuckles at 9:11 AM on September 21, 2006

« Older Value My Website   |   Glitter Ninja wants new exciting cancer sticks~! Newer »
This thread is closed to new comments.