A pesky server that won't obey me!
October 21, 2005 8:12 PM   RSS feed for this thread Subscribe

I have a pesky server at the office that will restart every 12 hours or so. Even after my best efforts, I cannot figure out why this thing keeps restarting. I have a couple ideas and need to know if they will work.

I'd like to find out if this is a hardware or software issue. We have a second server of exacts specs. I would like to swap the hard drives and see if the second server will restart randomly as well. The problem is that they are both raids. Can I just take the full raid out of one and put it in the other machine? Or if not, can I take an image of a raid? What program would you recommend? The servers are both running Windows 2003 small business. I hope this all makes sense, thanks!
posted by meta87 to technology (17 comments total)
Windows executes a core dump if shit hits the fan and saves it to file. That should provide some clues.
posted by ori at 8:19 PM on October 21, 2005


SBS is an odd (ok, ugly) beast. One of the things it will do is shut itself down, believe it or not, if it does not like the way you have configured it. Joining it to another domain controller can cause this. I went through this myself last summer.

See this link, for some details.
posted by SNACKeR at 8:29 PM on October 21, 2005


The reason for the shutdown should be spelled out in the event log. Have you looked there?
posted by Rhomboid at 8:49 PM on October 21, 2005


Yes, I have looked in the event log. There is no useful information. Thanks for the help so far. Does anyone have any idea if switching the raids would work?

Thanks!
posted by meta87 at 10:14 PM on October 21, 2005


I don't know SBS off the top of my head, but in regular W2K3.. Control Panel -> System -> Advanced tab -> Startup and Recovery -> Clear the checkbox for "Automatically Restart". If it's bluescreening, you'll at least get a chance to see why.

If you suspect it's got a corrupt DLL or some such software issue, run "sfc.exe /scannow" to invoke the file protection check.

Hardware side, random reboots to me suggest transient failure, rather than 'magic smoke release' failures. I would guess either bad memory or dying power supply. Any hardware changes lately? Have you blown the dust out of the PS? Will the BIOS report system voltages? Depending on the age of the hardware, you may want to replace the battery on the mobo. When the RTC clock battery goes, odd things can happen. And it's not overheating and going into thermal-protection shutdown, right?

Also, depending on the disk hardware, try installing a SMART monitor to get low-level diagnostics from the disk spindles. Restarts aren't generally disk-related, but I recommend SMART monitoring to everyone anyway.

As far a moving raid disks around between controllers - it's very manufacuturer dependent, but so long as the controllers are both identical down to the firmware revision it should be OK. But my spidey-sense doesn't tingle on disks if the thing is rebooting spontaneously... that's not what disk failure looks like IME.
posted by Triode at 10:34 PM on October 21, 2005


How's the mains AC power? If your local wall voltage is unstable, that could be a factor. Adding an UPS or line conditioner can sometimes clear up otherwise untraceable bad mojo.
posted by Triode at 10:41 PM on October 21, 2005


Thanks for the response. No, I'm sure it isn't tempature and doubt it is the clock battery because this is only a couple months old.

The reason I wanted to switch the raids was so I could run the pdc on hardware that I know is functioning. Then if that machine also starts to reboot I would know it is a software situation.

Thanks!
posted by meta87 at 10:43 PM on October 21, 2005


Software troubles would light up the event logs, I think. Again w/ the SBS not-my-thing caveat, but W2K3 is quite stable IME. Are there any services that are set to reboot the computer after N failures? That's not common config, but it could be created. It would leave miles of error logs, though. If there really is nothing in eventvwr, that sounds to me like the rug is being yanked out from below Windows, and too quickly to write a log event. Thus my interest in HW. Does the hardware vendor have a bootable HW diagnostic disk? HP, Dell, etc. have OS-independent diags you could try. May I ask what brand hardware this is? Is it all on the msft HCL? (SBS is only sold on turnkey hardware, IIRC, so it almost assuredly is HCL'd) Another random thought - most RAID cards are OEMed from folks like LSI Logic, Adaptec, etc. You might track down the documentation from the chipmaker for the RAID transportability question.

You're going to tell me that both stable and unstable boxes are plugged into the same outlet, and shoot down the UPS idea, aren'tcha? 'twas worth a shot.
posted by Triode at 11:03 PM on October 21, 2005


Yes actually they are both plugged into ups's so that isn't it I think. I am kind off thinking that it could be a power supply problem. I'd have already switched it out with a new one, but it is this big strangely shaped power supply and will probably have to order one from the manufacturer.

It is an HP machine BTW. I will see if there is a diagnostic disc.
I agree that it is odd that there isn't much in the event logs. I'll let you know what i find out.

Thanks!
posted by meta87 at 11:12 PM on October 21, 2005


HP - oh good. You can swap disks around on HP raid controllers until the cows come home. It should also have an 'Integrated Management Log' which may prove enlightening.

And maybe you know this, maybe it's new: On HP servers, anything with a burgundy-red handle is hot swappable. Light blue is cold-swap. I ran the stuff for years before someone told me!

Good luck!
posted by Triode at 11:27 PM on October 21, 2005


Does it happen at the exact same time every day? If so, it might be a periodic process on the machine, or a periodic external network process that kicks off the killer process.

A cheesy way to find out: write a little script that outputs the time every second to a file. When the server reboots there will be a gap in the log.
posted by maschnitz at 10:58 AM on October 22, 2005


No it doesn't happen at the same time each day, but it does happen after roughly 12 hours of operation. It is sporadic though.

thanks
posted by meta87 at 11:17 AM on October 22, 2005


Please update this question if / when you discover the answer. I'm curious. Although it is obvious, have you done a full virus scan on the machine? Also, if you're planning on swapping the hd's, backup etc, I'd suggest starting safe mode the first time you boot, in lieu of different hardware configuration (same server, but different mac id's possible unique machine identifiers that msoft might employ. I don't have hard evidence, but IME, something as simple as putting the hd's in the different box turns out to be not so simple. Also, if you suspect the PS, why not swap the PS out from the good box into the suspect one? This'll be hella easier and should be devoid of config change problems.
posted by AllesKlar at 4:55 PM on October 22, 2005


Have you blown the dust out of the PS?

Dust is a subtle enemy. I once had a dual P1 (yes, that old) server, that after a few years started doing random reboots. Because I had not changed any software config I knew it had to be hardware. I tested *everything* to no avail.

It continued to bug me for weeks. I eliminated every possibility except for the motherboard. I had no spares. So I sat down and began to examine it in depth. After a few hours or so, I found the problem: a stray sharp metallic particle (possibly solder, possibly environmental dust) had managed to lodge between two traces on the motherboard, presumably shorting two or more subsystems.

It took a close-range, prolonged burst of compressed air through a point nozzle to dislodge it.

Moral of the story? It's true that when you hear hoof-beats in the distance, it's a good general rule to think of horses, not zebras. But sometimes it's a good idea to keep an eye out for stripes.
posted by meehawl at 6:15 PM on October 22, 2005


Sit around and watch it reboot.

Why?

An old engineers anecdote: a server was oddly rebooting every 12/24 hours. No-one knew why.
One night, 2 engineers decided to stay late to investigate and keep an eye on it.

Late, after everyone had usually gone home, the cleaning lady arrived. Just about the time the server often rebooted.
She proceeded to say hi to the engineers, and wander over and unplug the server to plug in her vacuum cleaner.

True story.
posted by nafrance at 5:01 AM on October 23, 2005


Actually to back up nafrances post...

Is it in a window-less server room/closet? I've had hardware that had built-in IR ports randomly reset when hit by sunlight at a certain angle.

And, I've work with other engineers who have similar horror stories about prototype equipment resets at a similar time each day and finding out it was something light-sensitive on the circuit.
posted by jkaczor at 1:14 PM on October 23, 2005


Haha nafrance, that sucks! I wish it ended up being that simple, but I did figure it out in the end. Thanks to everyone for the help. After figuring out how to use the debugger, I was able to locate a driver that was crashing windows. It belonged to a program I had installed called packet analyzer 5 enterprise. I'm pissed that I caused all this trouble, I was hoping I could blame someone else!

Anyway, thanks again for all the help! :)
posted by meta87 at 4:36 PM on October 28, 2005


« Older Are any of these old vinyl LPs...   |   Is there a way to tell a Power... Newer »
This thread is closed to new comments.