Why, oh why, RAID5 doest thou split and die?
April 20, 2008 2:25 PM   Subscribe

Help me make sense of my wonky RAID-5 failure.

Ok, so I have a RAID-5 array composed of six 320GB disks (so a total space of about 1.5TB) on a AEC-6897 controller (which is a controller I've always thought felt a bit squirrelly).

Today I discovered that the array has failed. I figured a drive must have failed, and I'd need to replace it. Not a huge a deal I assumed (and I thought I heard a drive do that *click* *click* noise a few times last night so I figured this might be coming).

Well, the strange thing is that when I pulled up my RAID utility to see which drive needed replaced instead of showing my 6 disk array with one failed disk, it is showing two separate arrays. One array with the first four disks, and one with the other two (both of which, naturally, are failed). But none of the individual drives appear to be failed.

I'm not sure how well I described my problem so here are some screen grabs of what I'm dealing with.

Any ideas how I should proceed? Is this a controller failure? If so can I just buy a duplicate controller, plug everything back in in the same order and expect it to work? Or is it a lost cause? Any other ideas?

Thanks much,
posted by Jezztek to Computers & Internet (5 answers total) 1 user marked this as a favorite
This looks like a driver fubar. If any data has been written to the now four-disk stripeset, you're probably hosed.

If not.

Reboot, get into the RAID bios. What you want to do is rebuild the array as it was, but *NOT* to initialize it. Ideally, you'd be able to read the raid config off the drives (AMR raid controllers do this.)

If you initialize it, you'll lose all the data. If you're lucky, it's a driver only fubar, and it hasn't written that setup to the controller, so a reboot will come up with a six drive array.

Possibility -- two of your drives did fail and dropped offline for a bit. If so, you may have lost data already, but hopefully a reconstruction will get off what you can.
posted by eriko at 4:27 PM on April 20, 2008

One of the drives on channel 1 has most likely failed, and has knocked out the 2nd drive on that channel with it - it's a standard problem with IDE raid systems, and comes from the shared cable with master/slave setup. RAID 5 can only cope with a single disk failure, regardless of the total number of disks. Losing two kills the array.

What looks like has happened is that the array has become desynced as a result, and now the two drives have a different timestamp than the other 4 - it's possible the drive fault was temporary, and is now operational again, but odd drive noises is never good.

If you're VERY lucky, finding the failed drive of the two and removing it will allow the 5th drive to be allowed back into the fold after a reboot, and allow the array to run in degraded mode, so you can back it up or replace the failed drive and rebuild. More likely, you'll have to repair the array and suffer some or complete dataloss, depending upon the quality of the rebuild tools.

As eriko says, reinitalising will wipe the array for sure.

Alas, RAID is no substitute for good backups, especially RAID 5 on IDE.
posted by ArkhanJG at 5:28 PM on April 20, 2008 [1 favorite]

Ah, slight mistake - that should be channel 3, above.
posted by ArkhanJG at 5:37 PM on April 20, 2008

Ok, so I tried removing each drive to see which one failed, but removing neither changed the fact the other drive was still being recognized as on its own array apart from the other four.

So I tried Eriko's suggestion and rebuilt the array as it was. Alas windows doesn't seem to want to recognize it (it just keeps asking me to initialize it) I tried loading up some data recovery software and pointing it at the uninitialized disk and it seems to be recognizing files (unfortunately they are just random unlabeled files).

So my next questions are: Is there any tricks to getting windows to recognize the drive without reinitializing it? Or failing that, what data recovery software might you recommend in this situation?

Thanks again,
posted by Jezztek at 9:11 PM on April 20, 2008

I'm not a huge raid expert, but generally there is configuration data stored on the drives. Linux software raid calls it a superblock, I think. I would imagine that this data became unreadable on the two oddball drives. You might be successful getting a disk editor and inspecting that data.

Also something to try would be to get the error checking software for the brand(s) of disk you are running and see if any funny errors come up. Don't know what this would do to your stripes if you allowed it to correct any errors, however. Just an idea.

One recommendation would be to determine the WORST drive and remove it from the picture. Since RAID is all about recovering from a single drive failure, doing so might increase the number of recoverable files.

And a final bit of good, tested advice- keep your files defragmented. I have had to recover more than one drive where all the contiguous files were easily recoverable, but the ones that were written or changed after the last defragmentation were hosed.
posted by gjc at 8:50 AM on April 21, 2008

« Older Discrete nano-reef filtation and heating?   |   Hiring graphic novel artist to create portrait as... Newer »
This thread is closed to new comments.