All of this will be lost .... Windows XP pining for the fjords?
December 8, 2011 7:54 PM   Subscribe

My installation of Windows XP Pro (SP3) seems to have developed intermittent amnesia. I've noticed a couple of times in the past week that files which I edited and saved under new names just -- disappeared. (Details inside). Today, after restarting the machine, I found my desktop wallpaper and Quick Launch icons had reverted to what I had several months ago, and my Recent Documents lists in MS and Adobe products was showing material from July 2011 and earlier. Plus, my iTunes library had suddenly been unable to find about 10% of my songs, although I have been able to manually locate them and make iTunes recognize a few again. These "missing" songs are both old and new: I don't see any pattern in time re what has gone missing. I'm assuming the worst -- XP death throes in progress! -- and am updating my data backups NOW. All my applications seem fine so far: I'm not missing anything that was recently installed. No other data seems to be actually missing. While I back up, is there anything else I should investigate or hypothesize?

Disappeared files in the past week: (All data is saved on a separate physical drive. C drive is only for the o/s.)

1) Word 2010 was used to save a docx file under a new name. Edited content extensively, saved, and emailed. Recipients said file was old version, and checking my hard drive, that was true: new name, all old content.

2) Three Adobe Captivate files were saved under new names and edited on Dec 2. On December 5, they didn't show up in the app's recent files list, and they were nowhere to be found on the data drive.

Oh, one more thing [/Columbo]: I have Norton Ghost 15 installed. I backed up my XP install last year (Dec 10 and 13 2010, with two recovery points set), but I haven't had the drive it's backed up to connected recently, so I'm used to seeing the Ghost icon in my taskbar with a red x through it. Today, after the restart, the Ghost icon appeared without the x. When I opened it earlier today, threat level was at 2 (Medium), but right now it's at threat level 1 -- Normal. Event Log is empty.
posted by maudlin to Computers & Internet (25 answers total)
 
Best answer: Is it NTFS or Fat32?

Have you run chkdsk.exe?

Since you are seeing whole files disappear, as opposed to corrupted files, My intuition is that the directory table is corrupted, not that individual clusters are corrupted, although that may be the case as well.

Since this is ongoing, and not an example of one case of lost or corrupted files, I think the hard drive may be physically failing.

Can't hurt to run chkdsk though.
posted by Ad hominem at 8:42 PM on December 8, 2011


Response by poster: It's NTFS.

I'll run chkdsk, but I'm trying to understand how both my C drive and the data drive could be affected.

I'm seeing a renamed file (the Word file) in the right place on my K drive, but with the wrong content. And I'm seeing my windows install (on the C drive) go back to old desktop configurations, and the recent files list (on applications installed on my C drive, referring to files stored on my K drive) rolled back to July. Which one drive could be causing all this to happen? Or by some odd coincidence, are both failing at the same time?
posted by maudlin at 8:58 PM on December 8, 2011


First guess is that the hard drive is going bad in an odd way.

Random stab is that you have clock problems. Maybe change the clock battery?
posted by gjc at 8:58 PM on December 8, 2011


Today, after the restart, the Ghost icon appeared without the x. When I opened it earlier today, threat level was at 2 (Medium), but right now it's at threat level 1 -- Normal.

WTF is a "threat level" in the context of a disk imaging tool?

Is it conceivable that Ghost could be "helpfully" restoring your drives to an earlier state after misdetecting some form of corruption?
posted by flabdablet at 9:13 PM on December 8, 2011


Best answer: Another thing worth checking for is bad RAM. Unzip this, burn a CD-ROM, boot from it, and leave it running overnight.
posted by flabdablet at 9:16 PM on December 8, 2011


Best answer: Good point about both disks losing files, I missed that.

I don't want to grasp at straws and suggest the the hard drive controller is failing, but it is possible.

I had chalked up the MRU list reverting to the fact that the app would check to see if the files were still there before displaying them in the MRU.

A local profile, your desktop config, is likewise a file or files. I guessed windows reverted to a old copy or created a new one.

The most likely scenario aside from failing hard drive is you have some sort of malware.
posted by Ad hominem at 9:18 PM on December 8, 2011 [1 favorite]


Response by poster: Forgot to mention: my original windows profile was corrupted July 22. I was able to create a new profile and move files back then -- which -- d'oh -- may have been the only profile left on the machine today after restarting. I never deleted the old profile, and I can't remember if I saw it listed when I restarted today, or if I even had to click a profile to start.

I'm running on fumes and haven't been taking good care of my computer, obviously.

RAM and clock and malware are maybes. Ran chkdsk as diagnostic 3 phases only -- no errors reported. C does need to defragged though.

Yeah, looks as if I'm back in old profile, not new. Bloody hell.

I'm making some tea now and will:

1. Finish data backup to external drive
2. defrag
3. repent

Feeling pretty dumb now. That corrupted profile should have been a clue. I save filezilla, firefox and thunderbird profiles to the K drive, so that's why I didn't notice I was on old profile.

I can't see Ghost reverting to a July backup if my last was December and the hard drive with backup wasn't connected.

Thanks for the ideas. Any further advice is still welcome.
posted by maudlin at 9:44 PM on December 8, 2011


Best answer: Next time you reinstall Windows, use nLite to build yourself a new setup disc for the job. Apart from being able to slipstream service packs and Windows updates, this will let you tweak a bunch of system defaults. One of the handy ones is the user profile root folder: changing this from %systemdrive%\Documents and Settings to K:\Documents and Settings would mean you wouldn't have to do anything the least bit clever to make your Windows partition into something it's totally safe to re-image at any time.

Also: Windows works out which profile to use at logon by consulting HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList\(user SID) which, as a subkey of HKEY_LOCAL_MACHINE, is in one of its central registry hive files. If something bad happens to a hive file, Windows will do various things to try to recover. It's entirely plausible that the SOFTWARE hive you're using right now looks more like it did in early July than in early December.
posted by flabdablet at 2:43 AM on December 9, 2011 [1 favorite]


Best answer: What do you see under this folder?

C:\Documents and Settings

If you see a slew of folders, it could be that your working profile is somehow becoming corrupt from time to time and Windows is creating a new one from a default (which may be your older profile depending on how it was restored last time).

Also, look at your event logs for other signs of problems:

- Right click on My Computer and select "Manage"
- Navigate to the System event log, note any red or yellow marked issues pertaining to your profile
- Also check the Applications event log for issues pertaining to Ghost.


You may want to temporarily disable Ghost to see if that makes a difference....but it will be hard to tell what exactly is going on until some log detail is examined.
posted by samsara at 1:48 PM on December 9, 2011


Response by poster: Thanks for the latest answers. I am now posting on an apparently docile and fully backed up machine instead of my phone.

If I reinstall XP (and I am seriously tempted to just say the hell with it and move on to a decent Windows 7 laptop that can also function as my home office machine), I'll look at nLite.

I checked my profiles and I am definitely on the older version that went corrupt in July, but the only other profile listed is the alternate I manually created back then. When my profile went corrupt in July, I didn't get any new profiles created by Windows. Odd that I had only the bad profile listed at login, or it brought me to the bad profile without offering any options, or if I simply I clicked on the bad profile at login. I honestly can't remember because I've been on multiple deadlines this week and am just catching up on sleep now. (I should log out and check -- but I'll back up again, just to be sure.)

Wow. That System event log shows mostly Information icons and a few Warnings from time to time. I have a bunch of errors, but most are DCOM ("The server {5A5AA0AA-1DEB-4683-96B0-B43301E83971} did not register with DCOM within the required timeout.") and none are about the disk.

The log only records as far back as September 23, so I can't trace what was happening in July with that first profile corruption. Most warnings are about TCP/IP connections reaching the security limit, but there are clusters re disk errors. I can't see any Application messages about Ghost when I go there through the event viewer.

Starting with a small burst on October 15, the disk messages were: "An error was detected on device \Device\Harddisk8\D during a paging operation." or "An error was detected on device \Device\Harddisk7\D during a paging operation."

There was one cluster of warnings in late September (disk 7), one cluster in mid October (disk 8), three clusters several hours apart on November 9 (disk 7), four clusters on December 6 (disk 8), and EIGHT clusters so far today, starting around 1 AM (disk 7).

Two similar clusters occurred on November 28 and 29, but these were re ftdisk: "The system failed to flush data to the transaction log. Corruption may occur." There's no indication of which disk was involved.

I'm guessing my latest profile was marked for death from that point on. However, I have two important questions:

1) Where do I look to determine which disk is 7 and which is 8? The BIOS at start up? I really don't want to restart this machine unless I'm ready to say goodbye to this install. And those seem like awfully high numbers: wouldn't the C drive be disk 0?

2) Really dumb question: if my data drive is somewhat corrupted, will my backup drive also be storing corrupted versions?
posted by maudlin at 2:43 PM on December 9, 2011


Response by poster: OK, I just looked in disk management.

C drive is disk 0 and is marked as Healthy, and my data drive (K) is disk 1, also allegedly healthy. Disks 2-6 are media drives and a USB stick (no media or USB inserted right now), Disk 7 is my external drive, also deemed healthy.

There is no disk 8. WTF? Given the sequence, I think the status messages were referring to the C drive as 7 and the K data drive as 8.
posted by maudlin at 2:50 PM on December 9, 2011


Best answer: "An error was detected on device \Device\Harddisk8\D during a paging operation."

This is one of those Windows messages it doesn't pay to try to decode too much, because Windows has helpfully truncated the actual device name (that D at the end is the first letter of the word Disk, and which disk is something you will never find out from looking at the event log). The only clue is that it's a paging operation, which might mean that there's a bad block inside one of your pagefile.sys files (which I don't believe you'll find on an external drive unless you've done something clever), which might have consequences like those you'd get from intermittently bad RAM. But you can also cause paging operations by memory-mapping other files, so that's not really definitive either.

When I see this kind of issue on a customer machine, I use command-line-based tools built into the Trinity Rescue Kit to find the faulty disk blocks and overwrite them, which either fixes them or causes the disk drive to replace them with spare blocks; then I use Windows chkdsk to deal with any filesystem problems that this surgery has caused. Most of the time, this works well.

If you're comfortable working in a bash shell, let me know and I'll document the scan+fix procedure.
posted by flabdablet at 6:26 PM on December 9, 2011


Response by poster: This is one of those Windows messages it doesn't pay to try to decode too much, because Windows has helpfully truncated the actual device name

How utterly .... brilliant. Given that my C drive is an old data drive that and the K data drive is new, I'm inclined to think the rapidly increasing number of warnings recently point to the C drive that is original to the machine I purchased in 2005. (It also frags up pretty rapidly: it was installed in December 2010 and needed to be defragged in March.)

I had planned on just running chkdsk fully this time, allowing it to fix any errors, but TRK looks interesting. If you can point me to the correct procedure, that would be great -- thanks!
posted by maudlin at 9:52 AM on December 10, 2011


Best answer: If you're going to run chkdsk, you generally do want it to fix filesystem errors. Letting it scan for bad clusters is a waste of time though, because all it will do is add any bad clusters to the NTFS bad clusters file and avoid using them thereafter - it won't actually make the drive spare them out, and it will lose you a whole 4K cluster every time a single 512-byte sector goes bad. It's also usually harder to clone an NTFS partition with entries in the bad clusters list.

The speed at which an NTFS filesystem becomes fragmented is entirely down to usage patterns and generally says nothing about drive quality - unless it's happening because NTFS frequently finds a need to straddle clusters its own chkdsk tool has marked as bad.

OK, so: how to use the tools that come with TRK to find and fix bad disk blocks.

Obligatory warning: do not do this without a backup of everything on the disk you're trying to fix. Small slips can cause major or even total data loss.

First thing is to find the device name of the disk you want to fix. Fastest way to do that is using
fdisk -l
(that's "minus lowercase ell") to display a list of disk devices and their partitioning. With a current versions of TRK run on a typical Windows box with a single internal hard disk, the internal hard disk will usually be /dev/sda, and that's what I'll assume for the rest of this.

Next: scan the disk for bad blocks, by copying all blocks to the null device:
ddrescue /dev/sda /dev/null sda.log
Pretty much as soon as that's got started, interrupt it with Ctrl-C and do the following:
modprobe raw
raw /dev/raw/raw1 /dev/sda
ddrescue --complete-only --max-retries=1 /dev/raw/raw1 /dev/null sda.log
There are a few things going on here.

We're now using a raw block device mapped to /dev/sda, rather than /dev/sda itself. This means that ddrescue gets to read directly from the underlying disk device, rather than getting there via the kernel's disk cache. We want that, because the disk cache always does its disk reads in chunks of 4KiB, while we're trying to locate single bad 512-byte sectors.

We've restarted ddrescue using the same log file (sda.log, which incidentally resides in TRK's default home directory /root, which is actually in RAM) as the interrupted copy, so ddrescue will pick up reading the raw device right where it left off reading the cached one.

The --complete-only option tells ddrescue to trust the log file to tell it what size the device is, rather than interrogating the device itself for this information. Linux raw block devices won't tell you how big they are.

Recent versions of ddrescue have a --direct option that effectively does the same thing as the raw-device dance. I do it the long way round so I don't have to care which version I'm using.

This scan will run for quite a long time (it's reading every single disk block) and the screen may go partially blank; just tap the Ctrl key to make it come back, if you care.

When it's finished, sda.log will contain a listing of all the bad blocks that ddrescue encountered. If you do
cat sda.log
and see any lines that end with -, your disk has bad blocks.

If you happen to have on hand an unused external hard drive at least as big as the one you're scanning, and you've found its device name (e.g. /dev/sdc) with fdisk -l, and you use that name instead of /dev/null in the command lines above, then you will end up with a clone of your internal drive as well as a bad blocks log. If your original drive is six years old, this is actually the right thing to do.

If ddrescue did indeed encounter bad blocks, here's how to fix them.

First, make a backup copy of sda.log:
cp sda.log sda.test.log
Now "rescue" the zero device /dev/zero onto the one with the bad blocks using the same log file, so that the only blocks that ddrescue will overwrite the ones already identified as bad.
ddrescue --complete-only --max-retries=1 /dev/zero /dev/raw/raw1 sda.log
I will generally prepare that command line by using up-arrow to restore the previous ddrescue command, then edit the device names. Double and triple-check it before pressing Enter. If any options are misspelled or you've used the wrong log file name or left out spaces or some damn thing, now is the time when ddrescue will scribble zeroes all over the drive I told you to back up but you didn't.

If all is well, that command will complete very quickly, and sda.log will not contain any trailing - lines after it's done.

Now check that all the blocks that were previously unreadable have become readable again:
cp sda.test.log sda.log
ddrescue --complete-only --max-retries=1 /dev/raw/raw1 /dev/null sda.log
The ddrescue command here is the same as that originally used to scan the drive, but this time it will only be reading blocks marked bad in the log file, all of which should by now have been overwritten and come good; so this command should also run very quickly, and sda.log should contain no - error lines afterward.

Now, if this is a Windows machine, force Windows to run chkdsk at startup:
ntfsfix /dev/sda1
reboot
and you should be good to go.

TRK also has the smartctl command, which is useful for interrogating disk SMART logs. It's possible to use the block numbers associated with SMART-reported errors to avoid lengthy disk scans, but I generally prefer to do the scan, as it will pick up errors in parts of the disk that the OS has not yet encountered.
posted by flabdablet at 7:57 PM on December 10, 2011


Response by poster: Wow. Thanks, flabdablet. That's extremely helpful, impressive and absolutely terrifying all at once.

I think that in order to do this properly, I'll have to be ready to lose my install if I make a single crucial mistake, but I'm overwhelmed with work up until Christmas.

I'm going to use a full chkdsk procedure -- including fixes -- immediately, keep backing up my data, try to get a laptop ASAP to continue working on as my main machine, and keep an eye on Event Viewer to see if it's still throwing warnings and errors. Once I have a safe place to work away from the machine, I'll be free to try something detailed and risk blowing up my XP install with a lot less stress.
posted by maudlin at 5:28 PM on December 11, 2011


Just something to keep in mind with this drive. Bad blocks are not normally a software problem, but are actually hardware related. Most of the time when you see bad blocks reported, it means that there are spots on the HD that are failing to hold magnetic writes. Unfortunately, these areas cannot really be "fixed" but are rather marked to skip when the drive is in operation. The potential issue is these bad "areas" will grow larger, especially on older drives, so it's best to stop using this one and get another drive as soon as possible.
posted by samsara at 5:09 AM on December 12, 2011


Well, yes and no. I have frequently seen bad blocks completely fixed by the above method (i.e. SMART reports no reallocated sectors after the zero rewrite) and I generally find that provided a drive has only needed this kind of attention once or twice it will keep working just fine for years. Seems to me that a small polished surface tasked with storing ten or twenty trillion bits should be expected to show a few microscopic defects.

Sure, if you find a SMART log showing more than a few tens of reallocated sectors + sectors pending reallocation + uncorrectable sectors, it's most likely not a happy drive. But small numbers of bad blocks don't necessarily mean anything's even slightly wrong with the drive; they might just be evidence that it lost power and had to abandon a sector write before laying down the error correction code at the end.

Anyway, as long as you're backing up the way your mother taught you, gambling on a drive's continued success is pretty low-risk.

I've just remembered another thing: Windows will often react to failed disk reads by backing off the DMA speed on the IDE channel allocated to that drive, usually all the way down to PIO mode, and then leaving it set that way forever. This makes the computer run grindingly slowly all the time, even after the bad block has been fixed or reallocated or added to the NTFS bad clusters list and is no longer causing trouble, and it would be very easy to draw the mistaken conclusion that one bad block = a ruined drive.

If you use the Device Manager to check the properties of the IDE channel connected to the primary hard disk, and it's set for "DMA if available" but currently running as "PIO Only", this is what's happened. You can make it zippy again by deleting that IDE channel and then rebooting. Windows will automatically reinstall it with the highest available DMA speed.
posted by flabdablet at 5:42 AM on December 12, 2011


True, just erring on the side of caution. Bad sectors aren't necessarily the harbingers of drive death...but are a pain when physical defects...I often compare them to little magnetic rabbits. But I agree 100% that's no reason to give up on hope right away. Oh and good call flabdablet on the DMA speed! Another factor the OP may want to consider though, being that it's an old drive...would be to not put too much trust in a drive past five years...this one is at almost seven. Regular backups will definitely be key to mitigating risk.

That comes with a word of caution too: if the drive *is* slowly failing...you may not notice it right away. Unlike a sudden head failure, a drive with physical bad sectors simply degrades slowly. So be sure if you are backing up regularly, to also do differential compares on the data sets to see if file checksums are changing without your knowledge. Backing up a corrupted file only to restore a corrupted file later on is no fun (or in this case, accidentally backing up a previous version of a file). Luckily technology has improved enough that the system will pick up on most anomalies with sector freckling when encountered...but there is still a possible risk some corruption patterns may go unnoticed...so continue to keep an eye on those logs and periodically check the disk to see if anything returns. The biggest pain may end up being with delayed write failures as you're trying to save a project you've spent considerable recent time on...eg. where backups wouldn't apply as it is a new save....if the work you're doing on this drive is hard to replicate, I'd consider that into the gamble, and possibly save it to a newer drive first before copying it to the older one.

Best of luck though! While bad sectors are annoying, things are still much MUCH better than the days of yore (1980's) with 10mb hard drives. Back then, the manufacturers used to affix stickers to the outside of the drives (which were in themselves typically half the size of car batteries) listing all the defective sectors. Yes, it was actually common practice to ship disk drives with known defects! (much like acceptable dead pixels on LCDs more recently...which is a practice on its way out as well).
posted by samsara at 8:41 AM on December 12, 2011


Yes, it was actually common practice to ship disk drives with known defects!

That hasn't changed, by the way. Only difference is that nowadays the known defects are automatically spared-out as they're detected during factory formatting, using exactly the same procedure that deals with field-grown defects, instead of being listed on the drive label.

being that it's an old drive...would be to not put too much trust in a drive past five years...this one is at almost seven.

Another good reason for replacing it is that if you were to use ddrescue to clone the old drive onto a (much bigger) new one, the entire Windows partition would end up on the new drive's outer tracks. I'd expect the result to feel much more responsive.

This technique is called short stroking, and there are various over-complicated ways of achieving it that work no better than simply making a drive's first partition substantially smaller than the drive as a whole. This happens naturally when cloning a small drive to a bigger one with ddrescue. I'd also expect that the only drive you could possibly buy to replace one built six years ago would indeed be much bigger.

Ideally, you'd then use the resulting spare space only as somewhere to put backups of something from some other drive (maybe the existing drive K:) so that there was no requirement for frequent access to that space in normal operation. Things would then stay zippy.
posted by flabdablet at 3:59 PM on December 12, 2011


Response by poster: I was briefly hopeful that a full run of chkdsk with fixes did some good, as I saw no disk related warnings in the System logs for several hours after restarting last night, but the warnings are coming fast and furious again (10 clusters since 3 AM). I have my eye on an Asus laptop downtown, but I have to get through at least one more day of teaching and development first. Meanwhile, I keep backing up my data to my external drive (not quite as Mom taught me, as apparently I'm supposed to go fuck myself. Mom always did have a potty mouth.) I guess I'm picking up another drive downtown, too.

As soon as I have time to breathe (2012?), I really have to try ddrescue to clone my sputtering old drive. The prospect of a truly zippy XP install delights me. Thanks again, guys!
posted by maudlin at 4:14 PM on December 12, 2011


10 clusters since 3 AM

If I owned that drive, I'd shut it down and ddrescue clone it now.
posted by flabdablet at 4:20 PM on December 12, 2011


Response by poster: So this really is as bad as I thought?

If I had the time to do do my first ddrescue, with all the risks and associated stress for a newbie like me, I would. But I'm under deadlines from multiple clients and have to finish three deliverables by tomorrow, then I'm teaching all day.

All my data is on a second physical drive (and that drive is being backed up to an external drive every hour). If the C drive blows up before I can use ddrescue, that will suck, but:

a) I have a slower laptop in my office right now with all my apps on hand. I can shift to that immediately if and when needed, but not any earlier, because it is really, really slow.

b) I'll have a Ghost image of my December 2010 clean install of XP ready to put on the new drive I'm buying tomorrow. I'll still have to add a few more apps, but that won't take long (all install files and keys are saved externally).

I'm walking out of the house at 4:30 PM Tuesday and buying a new laptop, I swear, so I have a safe and reasonably fast place to continue working until I can salvage this XP install.
posted by maudlin at 4:41 PM on December 12, 2011


Here's a recipe for using TRK to clone a failing drive onto a new one, extracting as many good sectors as possible. Use it when you're ready.

First, identify the drives. This is easiest if the only hard drives connected are the old one and the previously unused new one, in which case
fdisk -l
will list one smallish drive with existing partitioning (I'll call this /dev/sda in what follows) and one big one without a valid partition table (which I'll call /dev/sdb).

Cloning is next. The ddrescue baked into the current version of TRK does have the --direct option and a kernel that supports it, so use that instead of all the raw-device foolery above:
ddrescue --direct --max-retries=10 /dev/sda /dev/sdb sda.log
About the only way you can cause death and destruction with this is by getting the device names wrong (the device to read from must go in the command line before the device to write to).

Now mark the first partition on the clone (I'm assuming there will only be one partition) to make Windows run chkdsk against it on startup, then shut down:
ntfsfix /dev/sdb1
halt
If you now physically remove the original drive from the computer, and put the clone in its place, it should just boot and go. Windows will run a chkdsk and (hopefully) fix any filesystem errors caused by sectors that couldn't be copied from the original disk, and then reboot; not long after first login, it will prompt you about newly found hardware that needs a reboot to work properly - that's the new hard drive and in fact it will work fine without that reboot.

Visit the Device Manager (right-click My Computer, Manage, Device Manager), check all your IDE channels, and delete any that are stuck in PIO only mode; reboot; you're done.
posted by flabdablet at 6:15 PM on December 12, 2011


Response by poster: Thanks for the recipe!
posted by maudlin at 10:06 PM on December 12, 2011


Welcome.

Do let me know how it goes if you use it.
posted by flabdablet at 1:56 AM on December 13, 2011


« Older Is it possible to purchase the Pill in Jordan?   |   ...tapeworms? Newer »
This thread is closed to new comments.