Corrupt Data in Docs - Oh No
January 28, 2009 1:06 PM   Subscribe

We have some file corruption on our network drive. Need to search through 1 TB worth of data.... How can I? Example insider.

Files corrupted are .doc, .pdf, and some .wpd. Most of the files open fine but they are filled with squares. I thought I could use agent ransack and find files that have a column of squares but when I cut and paste the agent ransack text is blank and it will not search.

See pic for example : here

The only common denominator is the created date is well after the last modified date on the file.

The file system is traditional NTFS (windows).

Any help to find contents of empty files that are random in size, time, date - would be much appreciated!
posted by bleucube to Computers & Internet (8 answers total)
Response by poster: Um Here:
posted by bleucube at 1:07 PM on January 28, 2009

Best answer: well that ain't good. time to bring out the python: use os.walk and os.stat to pull C_TIME and M_TIME, and you can compare and contrast. you can also open up the first few K and look to see if those corrupted hexii i see from your screenshot are in there. if you need more help, go ahead and mefimail me.

i feel your pain though, we had a 10TB partition made from 2 5TB SAN mounts, that were LVM'ed and RAID 0'ed. The shitty compellant SAN decided to rebalance, fuck up the UUID's, and annihilate the data (all $15k of it).
posted by Mach5 at 1:14 PM on January 28, 2009

Response by poster: Getting a little further with Agent Ransack. Using the expression engine I can "find" a corrupted document on:

Line Begins with "Don't Know"
Followed by "Any Character" that occurs "Zero or Many times"
Line Ends with "Don't know"

Of course this fines everything but at least it recognizes a character within the document. Now I just need to know what character the square is perceived by the app. If I open the .doc in notepad it list NULL.
posted by bleucube at 1:27 PM on January 28, 2009

don't bother with notepad, get HxD and take a look at the raw file. its probably not going to be an ascii character.
posted by Mach5 at 1:54 PM on January 28, 2009

Those boxes are just what Word displays when it runs across a non-displayable character. So what you have is a sequence of non-displayable characters, not one character displayed many times. You'll never be able to copy+paste search for these because each sequence is most likely unique.

In case that was a little obtuse, consider two fictitious documents containing the following byte sequences:



Word will display both of those as a series of six boxes, but a search for {0x05,0x06,0x07} will only turn up the first one.

IMHO, the best bet is to get python or perl and do the mtime vs. ctime search described above.
posted by sbutler at 2:24 PM on January 28, 2009

If the dates are wrong then the whole disk is falling apart and the filesystem is corrupt. Take an image of the disk and use something like spinrite on it. Did you check smart or do a disk scan yet?
posted by damn dirty ape at 9:25 PM on January 28, 2009

Response by poster: Thanks for the information. Sbutler is right the hex is different. So on to python. Mach5 - I use your suggestion and give it a whirl. Damn Dirty Ape - the server is newish and the data was migrated about 5 months ago. According to our backups (prior to migration) it looks like data corruption took place on the old server....still checking on this tho.
posted by bleucube at 3:02 AM on January 29, 2009

Response by poster: Mach5, I sent you a mefimail. I'm at a lost - first time python user. Ha.
posted by bleucube at 3:49 AM on January 29, 2009

« Older how can i turn off windows live one care backup......   |   Any handy way to warm up my idle feet? Newer »
This thread is closed to new comments.