Finding files consisting of only NUL character in Windows
December 6, 2016 6:31 PM Subscribe
So I've got a folder for my music projects going back to 2000. It's a mess, and I back it up every so often. I had a HD crash a few years ago, and I have no idea what I used to restore it, but I remember at the time I felt lucky I got back everything. Except I don't think I did. Whatever I used ended up writing files with the proper byte size consisting entirely of NUL characters.
There seems to be a large amount of them. Maybe about a third from what I'm guessing at by comparing different backups with Anti-twin.
Does anyone have any advice for checking through a large nested directory in Windows for a string of 4 or 5 of NUL characters in a row in each file so I can generate a list (with paths) that I could use to check them?
I'm thinking a list generated from the command shell would be ideal so I could maybe mass delete them with it once I've gotten a handle on what data's got the rot. I'm open to other ideas. I've known about this issue for a couple of years, but finally have a computer fast enough to deal with it once and for all.
There seems to be a large amount of them. Maybe about a third from what I'm guessing at by comparing different backups with Anti-twin.
Does anyone have any advice for checking through a large nested directory in Windows for a string of 4 or 5 of NUL characters in a row in each file so I can generate a list (with paths) that I could use to check them?
I'm thinking a list generated from the command shell would be ideal so I could maybe mass delete them with it once I've gotten a handle on what data's got the rot. I'm open to other ideas. I've known about this issue for a couple of years, but finally have a computer fast enough to deal with it once and for all.
Response by poster: I'm trying ^\0+$ but I keep crashing grepWin.
posted by Catblack at 7:52 PM on December 6, 2016
posted by Catblack at 7:52 PM on December 6, 2016
I'd probably go for Perl or some other scripting language (even on windows somehow). Read each character from the file and exit 1 if the character is not NULL. If you reach the end of input, exit 0.
posted by zengargoyle at 8:07 PM on December 6, 2016
$ echo foo > foo $ dd if=/dev/zero of=zero count=100 $ perl -e 'while(defined($c=getc)){ord $c && exit 1}exit 0' < zero; echo $? 0 $ perl -e 'while(defined($c=getc)){ord $c && exit 1}exit 0' < foo; echo $? 1This would make it easy to find files with only NULLs. Sadly I don't know enough windows batch file or powershell to make it useful.
posted by zengargoyle at 8:07 PM on December 6, 2016
Oh, for grepWin or whatnot, search for the equivalent of 'a single not \0' and invert the results. That way it will fail as soon as you find a 'not \0' character.
Something like:
posted by zengargoyle at 8:10 PM on December 6, 2016 [1 favorite]
Something like:
[^\0]
posted by zengargoyle at 8:10 PM on December 6, 2016 [1 favorite]
In Python, os.walk makes it relatively easy to recurse through directories. Here's an example using Python 2.7 that reads the first 8096 bytes of any files underneath the directory that you specify as the first argument, and searches them for a sequence of 5 null bytes. If it finds that sequence, it prints out the path to the file.
Also note that if you want to hardcode the path to the directory in your script, you'll want to use raw strings so that you don't have to escape all your backslashes. E.g., use this:
posted by clawsoon at 12:48 PM on December 7, 2016
import sys, os, re basedir = sys.argv[1] for root, dirs, files in os.walk(basedir): for fname in files: fpath = os.path.join(root, fname) f = open(fpath, 'r') if f.read(8096).find('\0\0\0\0\0') != -1: print fpathBe warned that lots of files will have sequences of five null bytes. It might be more useful to check whether the first 5 bytes are null, since that is much less common:
import sys, os, re basedir = sys.argv[1] for root, dirs, files in os.walk(basedir): for fname in files: fpath = os.path.join(root, fname) f = open(fpath, 'r') if f.read(5) == '\0\0\0\0\0': print fpathIf you use Python 3 instead of Python 2, there may be some weirdness because of Unicode strings. Someone with Python 3 experience may be able to comment on that.
Also note that if you want to hardcode the path to the directory in your script, you'll want to use raw strings so that you don't have to escape all your backslashes. E.g., use this:
basedir = r'C:\Users\Me'...so that you don't have to do this:
basedir = 'C:\\Users\\Me'
posted by clawsoon at 12:48 PM on December 7, 2016
Best answer: In case it's not clear, after you install Python 2.7 and save whichever version of the script you'd like, you launch it like so:
C:\Python27\python.exe "C:\Path\To\Where\You\Saved\TheScript.py" "G:\Folder\With\Bad\Files"
If you wanted to make your life a little easier, you could change the script a little bit and output it to a batch file that you can edit and save:
posted by clawsoon at 1:04 PM on December 7, 2016
C:\Python27\python.exe "C:\Path\To\Where\You\Saved\TheScript.py" "G:\Folder\With\Bad\Files"
If you wanted to make your life a little easier, you could change the script a little bit and output it to a batch file that you can edit and save:
import sys, os, re basedir = sys.argv[1] for root, dirs, files in os.walk(basedir): for fname in files: fpath = os.path.join(root, fname) f = open(fpath, 'r') if f.read(5) == '\0\0\0\0\0': print 'del "{}"'.format(fpath)...and then:
C:\Python27\python.exe "C:\Path\To\Where\You\Saved\TheScript.py" "G:\Folder\With\Bad\Files" > delnull.batIf any of your filenames have double quotes in them, you might instead want to go with a pure-Python version so that you don't have to worry about quoting.
import sys, os, re basedir = sys.argv[1] for root, dirs, files in os.walk(basedir): for fname in files: fpath = os.path.join(root, fname) f = open(fpath, 'r') if f.read(5) == '\0\0\0\0\0': do_it = raw_input('Delete "{}"? '.format(fpath)) if do_it = 'y': os.remove(fpath)
posted by clawsoon at 1:04 PM on December 7, 2016
« Older Is spontaneous recovery from Borderline... | Ideas for how to practice receiving criticism Newer »
This thread is closed to new comments.
posted by djb at 6:48 PM on December 6, 2016