Help? My RAID Crashed!
April 19, 2007 8:25 PM   Subscribe

What happened to my RAID setup today?

Ok, here was my setup:

PowerMac G4 (Gigabit Ethernet) Dual 500 Mhz, 512 MB RAM
40 GB IDE HD for System HD
(4) Western Digital 250 GB SATA2 Drives
Highpoint Technologies RocketRaid 1810A PCI SATA RAID Card

I've been working on/building computers for a LONG time, so I'm pretty handy inside the case. I set this up about a month ago to use as my home media server. The 4 SATA2 drives were setup as RAID 5 using the RocketRaid controller. After parity and formatting overhead, I had about 700 GB free to use.

Absolutely no problems, running more or less 24/7 for about a month. Today I just thought I'd open up the RAID web interface from my iMac to poke around a little bit. When I logged in, it immediately told me my RAID status was "critical." When I went into array maintenance, it said that drive 2 was offline. I checked the log and didn't see any note of when it had happened.

I disconnected the system, and brought it out to have a look. I just opened it up and made sure all the connections were good. I didn't smell anything unusual. I took it back into the bedroom where it normally stays, and hooked it back up. Now it wouldn't turn back on. It would start and stop almost instantly like something was shorting out or grounded.

I brought it back out to have a look, and when I opened it up, I smelled that smell that means something bad has happened. After some trial and error, I discovered the system would not boot with drive 4 connected. I pulled it out of the case, and it definitely had the fried electronics smell. I hooked everything else back up, and it booted fine. But when I checked the RAID status it was only recognizing 1 drive. This means that 3/4 drives failed.

I verified that all of the power connections and SATA connections on the controller card were working by testing them on the remaining working drive. Everything checked out, 3 failed drives, 1 working drive, all connections fine. Only 1 of the drives had that burnt electronics smell, the other two that failed didn't have any smell at all.

My question is what could have possibly caused all of this? I know that all 4 drives were working fine less than a week ago, so at most I've been running degraded for 4-5 days. Is it really possible that all three drives failed within minutes of each other? The real kicker is that I went with this route to give me the most capacity and still have the parity for protection. I've lost a lot of data; I probably had about 80% or more backed up to disc, so it's not a total loss. I've already RMA'd the drives with Western Digital, but I'm scared to rebuild the array without knowing what caused it. I don't know if I can trust it now, and I don't want to go through this in another month or so.

Also, if you'd recommend against rebuilding the array, I'm open for suggestions for 700 GB+ storage solutions capable of streaming HD content over the network that will work well serving to primarily Macs.
posted by drgonzo2k2 to Computers & Internet (4 answers total)
 
I suppose it's theoretically possible that 3 of the 4 drives were bad, but I agree that seems really unlikely.

Just off the cuff, I'd be really suspicious of both the RAID card and your computer's power supply. Either of those could have fried the drives.

Also, static-electrity issues, like a bad ground somewhere, might have also contributed ... although I have to admit I've never heard of a failure quite like that.

Testing a power supply isn't the easiest thing in the world to do, because they change characteristics under load. It might be that with just one drive, your supply is fine, but with 5 drives plus the RAID card, plus the additional cooling burden, something went bad and it started putting out nastiness that fried the drives.

That's really about the only scenario that I can think of. If the machine takes a standard ATX supply (I can't remember off the top of my head whether a dual G4 GigE does or not ... it might require an adapter/wiring harness), I'd probably replace the PSU just to be careful ... if it's going, you don't want to wait. Then I'd put in 4 new drives and just let it idle, or run some benchmarks on it continuously, for a few weeks. (And check the outlet's ground with a circuit tester, and make sure that the chassis and drives are all grounded, etc.)

FWIW, I've heard a bunch of people suggest that it's bad mojo to use 4 identical drives in a RAID array like you're doing. It's generally recommended to spread the risk out by using different manufacturers, or at least different models/batches. This is just to prevent accidentally buying the modern day equivalent of the old DeathStars and having them all go at once.
posted by Kadin2048 at 8:39 PM on April 19, 2007


As you've learned, RAID isn't a "fire and forget" strategy, unless it's implemented with some kind of alarm notification. You just do RAID to give yourself a strategy of gracefully coming down from a single disk failure without data loss, perhaps while still meeting performance targets, and you know that, going in, the reliability for the array in toto will be lower than it will be for the components out of which it is made.

Still, it can have a place, although you probably would have been better off, from a reliability standpoint, using a RAID 10 (RAID 1 + 0, or mirrored striping). You would need 350 GB drives (or larger) to get your 700GB effective, however, and your MTBF for the array as a whole would still be less than the MTBF for any one of the component drives.

And after many years of working with large storage systems, I'll venture to say that generally, you have to do a lot more than just bolt together several drives and a RAID controller to get good results from a RAID. Power supply and thermal considerations are basic, but typically, you'll need 5 - 7 times the airflow of a standard single drive PC case to maintain acceptable internal peak temperatures, and concomitant power supply capabiities. Inter-drive sheilding counts and drive mounting to minimize vibration and torque effects are important. Cabling, even in SATA or SAS based arrays is important.

Really, reliable RAID is an engineering and packaging project, even in this age, and it's very hard to recover the cost and effort needed to build reliable RAID arrays in small unit quantities. If you don't need the streaming throughput that higher RAID levels can provide (and RAID 5 wouldn't indicate that you are doing RAID for performance), you'd be operationally better off with doing some kind of LVM, and a good automated journaled filesystem backup strategy to nearline disk media backup, than any kind of RAID, in terms of both reliability and disk performance.
posted by paulsc at 9:15 PM on April 19, 2007


Using 4 drives of the same make and from the same manufacturing batch is not the best idea. There is a not-so-small possiblity, that the life-time / MTBF of the individual drives is quite equal, if the manufacturer is applying high quality production standards (like most producers will do).

Taking drives from different batches will improve the situation, because variation between different batches will increase the chance, that not all drives will die at the same time for the same reason. The more batches, the better!

With drives from different manufacturers, the chance of fatal errors at the same time will be even less - given, that not a singular external event is causing errors on all the drives (e.g. high voltage from thunder storm, fire etc.).

A colleague of mine worked at a large computer company, and they were taking care of this issue when setting up RAID system for sensitive environments.
posted by cwittmann at 1:30 PM on April 20, 2007


It definitely sounds like either a power surge or the RAID controller failed (which could have also come from a power surge).

There is no easy way to fix this yourself and more than likely if you want the data you will have to call a data recovery company.

Most of them charge a fee to look at your RAID which is in the hundreds of dollars. I have personally used a company who DOES NOT charge that fee and it is ReWave.
You can look at their info at: http://www.rewave.com/raid-data-recovery.htm

Since there is a possibility that 3/4 drives are fried, then it wouldn't be wise to spend a lot of money on evaluating it unless the data is truly worth it.
posted by tommoss87 at 7:25 AM on November 13, 2007


« Older Saving a FoxNews video stream.   |   Finding friends in the no-fun city? Newer »
This thread is closed to new comments.