Software RAID Dies: Which disk is dead?
July 28, 2008 3:06 PM   Subscribe

A disk in your Server2003 software RAID fails: How do you know which physical disk to pull?

I'm adding another RAID5 array to my server, and I'm thinking ahead about disaster management.

The whole point of having a RAID is that if a disk fails, I pop in a new one, the RAID gets rebuilt, and it's no big deal. But which disk to remove?

If they're all identical, and they're all on a PCI SATA card, then you can't just pop into the BIOS and see which port has a dead drive, because the BIOS won't be aware of the SATA card.

And the BIOS probably wouldn't be aware of a partial failure anyway. If the drive registers on the port but doesn't handle data anymore, the BIOS won't know.

Is this something that should be thought of as the drives are added to the system. Maybe add each drive one at a time and mark each with its ID as reported by the Disk Management snap in? (There doesn't seem to actually be an ID for each drive here, other than the label, ie: Disc 0, Disc 1, CD-ROM 0).
posted by SlyBevel to Computers & Internet (10 answers total) 1 user marked this as a favorite
 
Can you plug them in one at a time, not mount them, and run a SMART tool on every one of them?
posted by stereo at 3:32 PM on July 28, 2008


Add-in RAID controllers have a BIOS of their own that you can use. On most controllers, you get the option to press "CTRL-A" for RAID configuration during system boot. In that software, there's usually a list of physical disks with serial numbers, if one fails, look for the serial number on a barcode or label on the physical disk to know which one died.
posted by disclaimer at 4:01 PM on July 28, 2008


...and hardware RAID isn't usually seen by the OS, so the Disk Management snap in isn't "aware" of the RAID array's physical disk configuration.
posted by disclaimer at 4:02 PM on July 28, 2008


for linux "smartctl -i /dev/..." or "hdparm -i /dev/..." will give you the serial number (hdparm only provide data for ata drives). do either of those exist for windows?
posted by not sure this is a good idea at 4:13 PM on July 28, 2008


and hardware RAID isn't usually seen by the OS

The primary hardware RAID cards that I come across are either LSI MegaRAID (which is what's embedded in most Dell servers, among others), or 3ware add-in cards (which have a long history of Linux and BSD compatibility). I don't know if these are as popular among Windows servers, but considering Dell is in the picture, I'm willing to bet that yes, they are.

Both of those hardware RAIDs do expose management interfaces to the underlying OS. They require specialized utilities to access, but those utilities are generally available for Windows, Linux, and sometimes the BSDs. You want to find these for your particular RAID card, and install them, because until you can get the OS actively talking to the RAID card, the RAID won't be able to notify you when a disk quits.

You don't want your RAID to fail too silently.
posted by toxic at 5:23 PM on July 28, 2008


For software raid, you have to label the drives. Or use the MMC disk management to find the serial number of the failed drive and replace the right one. Or look at the port number of the failed drive and hope the OS port numbers and physical port numbers sync up.

You are correct, the BIOS probably won't be aware of any failures.
posted by gjc at 6:13 PM on July 28, 2008


With one drive connected at a time, boot with Ultimate Boot CD and run a hard drive diagnostic program (make 100% sure it is only going to read :P).

Not optimal, but really pretty harmless in the end.
posted by Chuckles at 7:23 PM on July 28, 2008


I know I'm not answering your question, but I would recommend against using software RAID on a production server. You have the identification issue and you're going to have horrendous performance if you end up with a drive failure. With software RAID, you've got to monitor the Windows Event Log for very specific disk failure events or you might not even know that you've had a failure until it's too late. If you unplug the wrong disk and try to boot Windows, there is a possibility that Windows may see the whole array as broken and drop your ability to recover.

Hardware RAID with hot-swappable drives will notify you there's been a failure, give you a nice flashy light to tell you which drive has gone, let you hot swap the new drive in, and help performance along during the failure/rebuild process. Plus, you can configure a hot spare so a failure is much less likely to lead to a disaster.

You're right to identify the drives now. Windows will give each drive a Location in the properties for that disk. When you have a problem, it's the Location that will be referenced. Identify the Locations when you start, and label them appropriately.
posted by cnc at 10:27 PM on July 28, 2008


Yeah, toxic, I didn't go far enough there. Dell server RAID cards have an audible alarm that will annoy you to death if a RAID array is degraded for any reason. And you use the server administrator software to manage RAID within windows.
posted by disclaimer at 3:45 AM on July 29, 2008


IBM servers have light-path diagnostics, which is a fancy name for a little panel of LEDs on the front of the server and a series of LEDs on every monitored component in the system. When one of our drives failed the front panel indicator lit up and so did the individual drive. We pulled that out and popped in a new one.

I've yet to install the OS-level management tools, which would allow me to look into the array without shutting down the system.
posted by odinsdream at 6:06 AM on July 29, 2008


« Older Haiti 101   |   Bring me the Blackberry related awesome! Newer »
This thread is closed to new comments.