Can long uptime cause server corruption?
November 3, 2009 8:42 AM   RSS feed for this thread Subscribe

Is there any documentation out there showing that sometimes (memory) corruption can occur when servers are left running for 6+ months?

I'm in charge of a production system that recently experienced a production error, one that was eventually corrected with a system reboot. The error was in a .NET application on IIS, and the current theory is that there was some sort of corruption on the server, maybe some memory corruption, caused by the system being running for more than 6 months without a reboot. Is that possible? I've tried google searches without success...
posted by joecacti to computers & internet (13 comments total) 2 users marked this as a favorite
I've never heard of this. Problems crop up on servers; sometimes they require a reboot, but 6 months is really not extraordinary for server uptime.
posted by Tomorrowful at 8:44 AM on November 3


Google just did a study on this. Here is a link to the research paper and here is an article summarizing the findings.
posted by bfranklin at 8:45 AM on November 3 [2 favorites has favorites]


This research paper from the University of Toronto, where they studied this phenomenon on thousands of Google servers, is probably a good start. If your servers are using ECC memory, most errors will be correctable and have no effect. The paper also talks about uncorrectable errors, and these are quite possible.
posted by FishBike at 8:47 AM on November 3


It almost certainly would not be a function of uptime. It may be a function of heat, but of just uptime. You would not be spared an error this afternoon just because the machine was off for a few seconds this morning.
posted by cmiller at 8:50 AM on November 3


It's possible but its almost certainly more likely that this is a programming error. 6 months is nothing.

I'm guessing this theory was put forth by the developers and not the sysadmin, right?
posted by shownomercy at 9:11 AM on November 3


It's possible, but I think a programming error is much more likely. There are a number of types of programming errors that would only show up after a long period of runtime, and are hard to detect for just that reason.
posted by grouse at 9:19 AM on November 3


Yes. Its possible. Assume that memory corruption happens at a constant rate (not strictly true, since error rates seem to go up with age) and that the chance that a newly booted server suffers a memory corruption in the next 5 minutes is exactly the same that a server that has been up for 6 months suffers memory corruption in the next 5 minutes. However, the server that has been up for 6 months may have suffered multiple memory corruptions in that time, and so has a cumulative chance of crashing because of a bad bit in an important data structure somewhere (or worse, that has been written to disk someplace important).

Really though, this shouldn't be a hypothetical. If you are rigorous enough to really want to get to the bottom of production errors, then you should be running servers with ECC memory, with the ECC protection turned on, and make sure that your system is set up to log any ECC errors. ECC can correct single bit errors in a 64-bit "word," it can also identify most >single-bit errors. There are some subtleties here, but the chances that bad bits go both uncorrected and undetected is really, really low.
posted by Good Brain at 9:21 AM on November 3


Possible, but a programming error is the far more likely culprit.
posted by paulg at 9:48 AM on November 3


A server running .NET on IIS sounds like a Windows machine, so tag---is it actually true that Windows servers need periodic reboots as maintenance? Not trying to start a flamewar, but it's something I hear and I've never been able to get a straight answer.
posted by d. z. wang at 11:01 AM on November 3


is it actually true that Windows servers need periodic reboots as maintenance?

Just to install Windows Updates (which you'll want to install to insure the box is secure).

Of course, that leads one to ask: why aren't the original poster patching their IIS box?
posted by JaredSeth at 12:35 PM on November 3


why aren't isn't the original poster patching their IIS box?
posted by JaredSeth at 12:36 PM on November 3


Most servers I run into have been up a lot longer than six months without a reboot. Heck, my own desktop Macs have been up for longer than that, and they're running a lot more various software.

So while it's possible, it sounds more like handwaving.
posted by rokusan at 2:56 PM on November 3


It's only tangentially related, but I can't be the only one who remembers the 49.7 day uptime limit bug in Windows 95/98. So while it's most likely wrong to think that uptime is itself a problem these days, it wouldn't necessarily be unreasonable.
posted by hades at 4:28 PM on November 3


« Older Considering grad school for pu...   |   Library-Related Trivia?... Newer »

You are not logged in, either login or create an account to post comments