Did chkdsk eat my files?
June 18, 2007 11:22 PM   Subscribe

So my server reset while my RAID array was synchronizing... now my files are corrupt? :-(

From what I can gather the array (500gb, hosts my media files) was synchronizing when the system reset. When the system came back online Windows reported that several directories were "corrupt" via some status pop-ups.

However, I didn't immediately run chkdsk, I waited for the array to finish synchronizing. The next day I restarted the server and chkdsk ran - in doing so it posted a lot of messages along the lines of restoring missing file links, parent links, etc.

After chkdsk the system booted fine and happily all of my files appeared intact exactly where they should by.

Unfortunately most of them appear corrupt.

Specifically my mp3s were located on this array, and while the files (along with the directory structure, file name, and size) appears fine they won't play. Ditto for my DVD rips, and several archived programs won't install from their .exe's either.

Are my files completely hosed? Or am I missing something?

The system is running Windows XP SP2. The array is RAID 5 and set as NTFS. I have access to winternals and other diagnostic software.

Thanks for any advice.
posted by wfrgms to Computers & Internet (8 answers total)
 
You don't give much information about your hardware, but generally, you should never run chkdsk on a RAID array unless the low level parity is perfect. Chkdsk is not RAID aware, and will simply try to analyse and fix the NTFS tree as if dealing with a normal disk. In doing so, it can write over information in a degraded RAID array, that might have been recoverable at a lower level. So, I'm not at all sanguine about your chance for recovery at this point, but still, for future knowledge of others who may find this thread, let us know a bit more about your server system:

Do you have a RAID 5 controller card, or is your array software only? If you do have a card, what kind? If you have a controller card, is it equipped with cache? If equipped with cache, is it battery backed (if so, no problem, as long as you haven't been down longer than the battery supports cache)? Have you tried letting the controller card rebuild the array (may take a few hours, depending on speed of the card, stripe size, and drives)?

If you have software RAID 5, or are using the Intel motherboard chipset RAID, or similar, getting Windows to rebuild a corrupted array correctly will depend entirely on things like whether you had write through to disk cache enabled (hopefully not), and whether the system disk cache was dirty at reset. If the system cache wasn't dirty, and you didn't have write through enabled, you should have been able to rebuild the array successfully, unless there were truly disk hardware problems. If the cache were dirty, however, and your system problem happened in the middle of write update, you wouldn't necessarily be guaranteed a consistent RAID state. For this reason, any machine running RAID needs to be on a UPS, with enough capacity to ensure a smooth shutdown.
posted by paulsc at 11:58 PM on June 18, 2007


you should never run chkdsk on a RAID array

I assumed as much, but in this circumstance the chkdsk ran automatically on restart and I didn't have enough sense to stop it once I noticed it running.

Do you have a RAID 5 controller card

Yes, its a Promise SuperTrak SX6000 with 128mb of cache.

I have not tried rebuilding the array. Do you think its worthwhile at this point?
posted by wfrgms at 12:08 AM on June 19, 2007


Ah, cache without battery is risky. If your PCI bus gets a reset with the controller cache dirty, the RAID state is more likely than not to be inconsistent. Nothing you can do at that point, as the bus reset typically dumps controller cache.

You really need to look at a high quality "on line" type UPS system for the whole machine, if you're going to continue to use this card. Even then, you'll probably get some RAID issues now and again, just do to cosmic gamma rays and Murphy's law. So, you'll have to have a consistent backup strategy and devices, and the means of doing bare metal restores after re-initializing your RAID, following a corrupting event. But an "on line" type of UPS may go a long way to minimizing these issues, and is probably worth the money.

In future, if you're committed to RAID, consider cards with battery options, that allow dirty cache to survive system shutdown and restart. That way, you can even change out bad disks in a power down state, and be assured of bit perfect rebuild, including unflushed cache contents, error free.

I doubt a RAID re-build at this point is going to do you any good. On the other hand, it isn't going to screw things up any worse than they are now, so what have you got to lose? But I really think you're basically looking at a restore from backup, to recover your files.

Sorry. Don't mean to be pessimisstic.
posted by paulsc at 12:27 AM on June 19, 2007


Another thing to do if using a controller card without battery backup for cache, is to be sure your disk settings in Windows and in the controller hardware are set to not use write through, or write back disk cache. You sacrifice performance with this, but you greatly minimize the problem of disk problems, particularly with RAID arrays, because you're basically never allowing Windows to run with a dirty cache. Everything makes it to disk, or Windows doesn't continue with disk operations.

Obviously, in write heavy situations, the performance penalty, particularly with slower disks, is huge, but in a file server situation where you're just serving media files mainly, not so much. And the lower performance is definitely worth avoiding the hassle of restoring large disk partitions when they're corrupted.
posted by paulsc at 12:36 AM on June 19, 2007


Power failures + complex file systems + complex disk arrangements = now you know what backups are for. I think you're hosed. Paulsc's advice is good.
posted by flabdablet at 1:16 AM on June 19, 2007


Fundamental lesson here: RAID is not a backup. RAID is designed to prevent downtime from the loss of a disk. It offers a little more protection than just a naked drive, but it's not a substitute for a backup and should never be treated that way.
posted by Malor at 5:16 AM on June 19, 2007


Thanks for all the comments guys. I wanted to at least ask before I started digging through my backups. Luckily I don't think I lost too much and none of it was irreplaceable.

I'm looking to upgrade the array later this year with a new card and drives, so I'll certainly shop for one with a battery back up, and yeah a I should invest in a UPC.

Regards,
posted by wfrgms at 4:51 PM on June 19, 2007


Oh, by the way, it occurred to me... chkdsk shouldn't cause trouble even if the array is resychronizing. Why? Because any decent controller will resync each cylinder that chkdsk hits before giving it any data. Chkdsk will never see any errors that were correctable.

What probably killed it was the powerfail during the resync... chkdsk probably wasn't involved.

UPSes are really cheap these days. You can get used APC 1400s with fresh batteries for about $150-$200. You'll have to spend about $100 on batteries every four years or so. They're built like tanks and last forever. Superb products.
posted by Malor at 8:36 PM on June 19, 2007


« Older Redirecting nowhere...   |   Is data recovery possible? Newer »
This thread is closed to new comments.