I do this for a living, really...you'd think I could fix it.
November 25, 2007 1:03 AM   Subscribe

I know it's early in the morning, but hopefully someone who can help is out there. My RAID-5 array just went titsup...while attempting to add a disk. Please tell me I didn't lose a terrabyte of mostly irreplaceable data...

MeFi tech support, help! This is a little long, sorry.

I had a three-disk raid-5 array on a windows server 2003 box (three 500GB WD drives), giving me just under 1TB of space. I recently (last night) tried to add a fourth disk to add space. Used the mediashield utility application (which is how I built it in the first place, as well as successfully replace a previous failed drive) to add the disk. Everything started well.

Right at 5%, it choked (it may be relevant (or not) that right about this time something tried to access data on the array). The application (and O/S) froze (yes I'm sure it wasn't just grinding away in the background). Had to reboot.

After rebooting, both the BIOS RAID utility and the windows-based application recognised what I was trying to do and attempted to continue building the new 1.5TB array. Right at 5% again, the app threw up a "raid access failure"...on one of the original three disks.

click. click. click.

One of the original three drives was what was clicking. Went out and picked up a shiny new drive to replace it.

I decided to hell with it and wanted to just go back to where I was with three disks, but the RAID manager is having none of it, it keeps trying to rebuild the new four-disk array.

If it helps, the current configuration is this:

drive 1: good original 1TB-array disk
drive 2: good original 1TB-array disk
drive 3: bad clicky original 1TB-array disk
drive 4: good new 1.5TB-array disk (only built to 5%)

The short version of the question is this: can I get two new drives and successfully rebuild the array that it's trying to build (1.5TB), or is everything lost?

I apologise for the rambling but this data represents eight years of media collecting, a significant portion of which is irreplaceable. Upset is a mild term.

(Yes, I know the mantra "RAID does not replace backups" but I haven't been able to find a cost-effective (or affordable) 1TB+ backup solution)
posted by geckoinpdx to Computers & Internet (8 answers total)
A lot depends on how and where the array metadata (the description of the array and it's state is stored), and if there is any duplication of that. Some high quality RAID controllers store a RAID configuration in local memory, and a copy of that on the disk array itself. A configuration change is done by first changing the controller's copy of the RAID configuration to include any hardware changes, then building/converting/re-building the new array to actual disks, and finally, writing out to the disk array copy the new finished, "as built" configuration, so that, once again, the memory copy of the metadata, and its disk array backup are in sync. So, with those kinds of controllers, if there is a difference in the memory copy and the disk copy of the array metadata at boot, indicating an array corruption, you get a chance to choose which configuration to use. So, if you had one of those, it would be a simple matter to pull the 4th disk, plug in a good replacement for your clicky disk, rebuild the original 1 TB array using the disk copy metadata, and then add the 4th disk to expand the array.

But, failing that level of hardware/software sophistication, well, as Chinese fortune cookies sometimes say, "Future not so bright." If you're using a typical Intel motherboard chipset RAID controller for SATA disks (Intel Matrix Storage Technology), RAID 5 parity calculations are off-loaded to the CPU, as opposed to being done on the chipset silicon, and there BIOS feature software is pretty limited, but perhaps you could explore boot time options for recovering your array, along the lines I've described. If you're using other chipsets, compatible with AMD processors, you'll be bound by chipset features.

It doesn't sound like you're doing RAID 5 strictly in software, so there's not any point in discussing options for recovering software managed RAID. Good luck recovering your array, and implementing a backup strategy in future.
posted by paulsc at 1:36 AM on November 25, 2007

Response by poster: If it helps, it's hardware-based (theoretically) and not solely software-driven. I have the drives mounted in a backplane but the board itself (asus A8N-VM, I think, AMD-64 chip) does RAID 0, 1 and 5 natively.

I used the BIOS utility to recognise and set the drives up, and the software to actually do the work.
posted by geckoinpdx at 2:12 AM on November 25, 2007

Response by poster: fixed link.
posted by geckoinpdx at 2:14 AM on November 25, 2007

All I can wish you is the best of luck re. recovering your existing RAID array; I hope you make it work. If I were personally in your situation, I would be be buying four more 500GB drives, disconnecting the originals from my RAID controller, and using Gnu ddrescue to make block-for-block copies of every single one onto a new drive before doing anything else (stop that, paulsc - I can hear you wincing).

As for cost-effective backup solutions for vast media libraries: if you buy drives about two sizes behind the bleeding edge, you'll find that they're a fair bit cheaper per gigabyte. Not sure what prices are like where you are, but where I am, 3 * 320GB drives comes out about 30% cheaper than 2 * 500GB.

Which means that the right way to do backups is just to buy twice as much disk storage as you actually need, using the cheapest-per-gigabyte size of drive available at the time, and periodically use something like rsync to keep two completely separate copies (preferably on drives fitted to different computers) in sync. Building a second computer just for backups sounds like overkill, but you can do it using whatever low-performance parts you have lying around since it doesn't need high performance just to run an rsync job overnight, and run it under Linux so you don't need to worry about software licensing.

You need to think of the cost of your backup machine basically as an insurance premium. Figure out how much the time you spent collecting all that stuff is worth to you, and see whether the cost of setting up a backup for it is reasonable compared to what you'd be prepared to pay to insure anything else worth about that much.

I really don't like relying on RAID controllers to do my mirroring for me. But if I was going to go RAID, I'd use Linux to do it in software, not a hardware controller whose compatible replacement I might be unable to obtain when it goes belly-up.
posted by flabdablet at 2:45 AM on November 25, 2007

Well, the Asus A8N-VM motherboard uses a nForce4 nVidia chipset, and there were known driver/hardware compatibility issues (particularly with early driver versions) in RAID configurations. But, information for that motherboard only indicates that it is hardware capable of RAID 0, 1 & 1+0. So, if you're doing RAID 5, you must be using some kind of software mojo to do it on the CPU.

I can't provide specific suggestions without knowing details of that software, but I'm more skeptical of your probability of success now, than before your follow up, geckoinpdx. If your RAID array was in the process of being re-written for expansion, without some kind of transaction log based roll back, full recovery is unlikely, as the array's early block parity information would have been split across 4 disks, while later parts of the volume are still only on 3. It's not an impossible thing to partially recover, but your recovery is unlikely to be 100%, particularly for blocks in the process of being "expanded" when your error occurred. Still, 99% or 98% might be a real blessing as far as you're concerned, and given that your data is largely media, the loss of single movie or set of .mp3 files might be acceptable to you, compared to trashing the whole collection. This article from Tom's Hardware describes some moderately priced paid professional options you might consider.
posted by paulsc at 2:51 AM on November 25, 2007

I second the recommendation of buying new drives and ddrescue (copy) the old ones over. To start.

Then i would try to assemble the 1Tb array with the 2 good discs (screw the clicking disc). Once that can be brought online, you should add a 3rd to regain redundancy, and let ir rebuild. If all goes well (big IF) you will be back where you started.

At this point, learn Linux RAID. Use the new discs you bought to ddrescue and build a new array. Copy the old one over. Count your blessings.

Why Linux? 2 reasons. First you can't always get a motherboard with the same RAID chip you need. But you can always get a new Linux CD.

And second, the recovery i described would have been possible in Linux RAID. If it is not possible with what you have, consider it a very very very expensive lesson.
posted by CautionToTheWind at 4:38 AM on November 25, 2007

There's one thing in your description that isn't clear, gecko. There's a difference between adding a drive to an array (where it can just sit there doing nothing, such as a hot spare) and migrating a 3-disk RAID-5 array to a 4-disk RAID-5 array. My understanding is that there's a separate migrate function in Mediashield. The question is; was the RAID array just initializing your disk for a future migration, or was it actually migrating data from the other three disks to the fourth disk?

If it's the former, and data wasn't written to the disk, I'll bet you can recover.

However, I suspect that it's the second situation, and if a migration was interrupted with both a drive failure and reboot... wow. Yeah, I doubt you're going to get that data back without bringing a data restoration firm into the mix.

Agreed with everyone in this thread who are telling you to dd the drives to "work drives" and keep the originals in their current state. That basically gives you unlimited chances to recover with a few trial and error strategies, so long as you don't mind the mind-numbing amount of time it will take to do the dd operations. You'd want to dd the two originals and the new drive; don't bother with the clicky drive.

Using duplicated drives, I'd try this configuration:

Good original drives, without the clicky drive and the new drive. Yes, a two disk array; RAID-5 will continue to function if you just pull one of the disks. If your original 1TB RAID is still valid, but the clicky disk was blocking it, it should fire up. It's a longshot, but worth trying.

If not, I'd try this configuration:

Good original drives and the new drive, without the clicky drive. Just leave the clicky drive disconnected from the channel. There's a small chance that there was still redundancy in the migration process, although I don't know enough about Mediashield to say.

If the data is absolutely, truly irreplaceable, then you can send the drives to OnTrack, who will do recovery of data from failed drives (and RAIDs). OnTrack is expensive but reputable, and will give you an estimate for ~$100 if I recall correctly.
posted by I EAT TAPAS at 8:38 AM on November 25, 2007

I really don't like relying on RAID controllers to do my mirroring for me. But if I was going to go RAID, I'd use Linux to do it in software, not a hardware controller whose compatible replacement I might be unable to obtain when it goes belly-up.

OT, but just to clear this one up -- I used to believe this too, but then I actually had some servers with hardware vendor RAID controllers and some servers with software RAID controllers, and holy shit is the difference in throughput huge. (Both were 3gb SATA on an AMD64 motherboard, one was a Tyan Mobo and the other was an HP DL385g4 with the HP/Compaq raid card) The load average due to iowait would rapidly climb into the stratosphere on the Tyan doing software RAID, and the HP/Compaq would be able to serve everything without falling over ('cept that one day where we pushed a terabyte down the pipe.)
posted by SpecialK at 11:55 AM on November 25, 2007

« Older Make my Christmas dinner a success!   |   Gift Baskets: Ideas and Alternatives? Newer »
This thread is closed to new comments.