How can I accurately determine which drive is actually failing?
October 31, 2007 4:59 PM   Subscribe

One of my hard drives threw a SMART error last night, but the symptoms are inconsistent with the self-diagnosis. What is the best way to do further testing?

Background: My computer has worked fine for years. I have not installed any hardware in just as long, and the last program I installed a few weeks ago has worked fine ever since.

AMD Athlon 3000+
2 HDDs of different sizes (one Seagate, one Maxtor)
GeForce 6600 video card (something like that)

Symptoms: I was playing Oblivion last night and after about half an hour I started experiencing very erratic pauses. It wasn't the regular bouts of slowdown I experience as the result of running this game on an underpowered PC, where the sound keeps playing and the video freezes temporarily-- in this case, both sound and picture freeze for 3 seconds, then continue. This problem extends into Windows as well. I restart the computer and it works fine for another half hour, then problem repeats. Eventually the game freezes altogether so I do a hard reset.

Prognosis: Here's where things get weird. As mentioned above, I have 2 drives. Oblivion is installed on the master drive. When the computer restarts, BIOS informs me that the secondary slave drive is failing. I shut the computer off immediately and have not powered on since. Perhaps it doesn't matter, but it seems odd to me that a drive that shouldn't even be actively used is causing all these hangups. The drive is not heavily used otherwise, and I believe it is even the newer of the two.

Both drives are on the same IDE cable; one is set to master, one is set to slave (as should be). Yet the slave drive supposedly threw the error. I have not changed the jumpers since installing the drives.

Stranger still is the fact that I've been running HDD Health and all this time that second drive has had a TEC well into the future (albeit with a 50% accuracy rate).

Inquiry: I've been burned before by ignoring SMART warnings, so I'm not taking any chances this time. I intend to back up the most important data from both drives to an external drive and relegate the problematic drive to one of my non-critical lab machines.

But the question remains...

How can I accurately determine which of the drives is actually failing? I don't entirely trust SMART's diagnosis. Are there any utilities I can run to do more in-depth testing than HDD Health?

posted by Ziggy Zaga to Computers & Internet (9 answers total)
Realize that drives frequently fail with no SMART errors being produced. There is no way to predict a failure before the failure, only to report read errors, and other statistics.
posted by SirStan at 5:26 PM on October 31, 2007

Smart is really unreliable, and non-standard between manufacturers in really scary ways.

Can I suggest grabbing spinrite? I've found that it's been a really great disk diagnosis / recovery tool when I've had bad problems before:

Probably worth the $90, especially if you've got a lab full of machines that may suffer occasional failures.
posted by jenkinsEar at 5:45 PM on October 31, 2007

Really, the error in HDD Health longevity calculations is too high to take them seriously. I really wouldn't put any stock in it.
Also, drives do fail randomly (or at least seemingly so). Out of all the hard drives I've owned, the two that have failed were both within a year of their manufacturing date.
If I were you, I would remove the second drive immediately, and then see if the pauses continue. Either way, backing up sounds like a great idea.
posted by Ctrl_Alt_ep at 6:31 PM on October 31, 2007

I would recommend against Spinrite. Many people, myself included, don't consider GRC to know what he is talking about.

Jenkins is right that SMART errors are diffcult to understand.This is due to lack of industry standard failure parameters. I think that at this point you have been given a warning you should heed. Buy a new disk (they are very cheap. 500GB / $100) before you have a catastrophe.
posted by ydnagaj at 6:32 PM on October 31, 2007

Both drives are on the same IDE cable; one is set to master, one is set to slave (as should be). Yet the slave drive supposedly threw the error.

They share the cable, and Windows polls the second drive at regular intervals. If that second drive is going through a failure mode, it can lock up the IDE bus for a bit.

Absence of a SMART failure doesn't mean a drive is good, but PRESENCE of one is a near-certain sign of imminent failure. Trust what it's telling you and pull that drive out of service.

Spinrite might be able to resurrect it, but I'd strongly suggest backing it up first.
posted by Malor at 6:45 PM on October 31, 2007

I expect you'll find that there's some file in use on the second drive that Windows is only reading (not writing to), and a bad sector has formed inside it. This is not necessarily an indication that the whole drive is about to collapse, but you should react as if it is. Your plan about backing up and relegating that drive to non-critical use is a good one.

If you have access to a tool that shows you uncooked SMART data, look for sectors pending reallocation on the troublesome drive. I expect you'll find at least one.

As soon as new data get written to a sector that the drive has marked for reallocation, the drive will sweep that sector under an internal rug, and you will never see it again; it will appear to have been magically fixed. This is probably what happens to most bad sectors quite spontaneously, but if the sector in question is only ever being read, never rewritten, it just stays bad and causes long delays every time it's accessed.

If you do a full (not quick) NTFS reformat when you put the drive in your spare machine, it will rewrite all sectors including the faulty one, and the fault will most likely go away.

If you want to be a bit less brutal than that, and you're feeling adventurous, and everything you care about is backed up, and you have another spare drive to play with, read this older thread.
posted by flabdablet at 6:52 PM on October 31, 2007

Absence of a SMART failure doesn't mean a drive is good, but PRESENCE of one is a near-certain sign of imminent failure.

In my experience, presence of one particular class of SMART error (sectors pending reallocation) is not an indication that the drive is about to melt down, unless there is a large number of sectors already reallocated. If you rewrite the pending sector(s) and let the drive reallocate them, the most likely result is that it will run just fine for a few more years. Were this not so, Steve Gibson would simply not be able to keep selling SpinRite, since that's the mechanism his product relies completely upon.

What SpinRite does, for each failed sector, is just read and read and read and read it until it has enough similar-looking copies to make an educated guess about what the actual data really should have been, then write that back to the same sector to make the drive reallocate it. All of paulsc's warnings about hammering drives on their last legs using ddrescue apply even more strongly to SpinRite. If the SMART data shows that the drive's spare sector pool is exhausted, DO NOT expect SpinRite to do anything except break it worse.

There are free alternatives to SpinRite (notably Gnu ddrescue) that can be persuaded to do pretty much what SpinRite does and lots more, but with less hype, more control and no slick UI.

Trust what it's telling you and pull that drive out of service.

If you care about your data, best practice is to view any threat to its integrity as an opportunity to duplicate it somewhere safer. And new drives really are crazy, crazy cheap; hard disk storage now costs less per gigabyte than DVD-ROM.
posted by flabdablet at 7:15 PM on October 31, 2007

I've used spinrite and it repaired a good handful of errors on an IDE 200G drive I had. Still in use but I dont keep anything important on it anymore.

Your lockups may be due to a bad block in the pagefile (drive cache) - if you are letting windows manage it, it often puts one on each drive it sees. I used to keep the pagefile on secondary drives on purpose.

Could try letting chkdsk repair it too, but I found spinrite to work well on the one above and fail on an ancient 6G laptop drive that was clunking and scraping its way to the got partway through recovery (after 14 hours) and couldnt continue - too much drive damage I suspect.
posted by clanger at 9:13 PM on October 31, 2007

All good answers, but Malor wins the cake.

During the course of my backups, Windows eventually stopped being able to boot anymore, leading me to think that it really was the PRIMARY drive that was failing since Windows was installed there. Then I re-read Malor's comment.

Both disks shared an IDE bus, and since one drive was failing, it was tying up the other. I disconnected the SECONDARY (at SMART's suggestion) drive and Windows booted without a problem. So I suppose it really was the secondary causing the problems.

I bummed a copy of SpinRite off a co-worker to see if I can give it a second wind, but the drive's life in a production machine has come to an end.

(The cake is a lie, btw.)
posted by Ziggy Zaga at 4:01 PM on November 2, 2007

« Older Supplement iTunes Library   |   Recent Blonde Redhead stuff is good - what else? Newer »
This thread is closed to new comments.