Compare files on a Mac
November 6, 2006 8:01 AM Subscribe
I have been building up a rather large PDF library of scientific literature. The problem is that I am sure some of these are copies that just have different names. Is there a (free?) application (for Mac) that will find the duplicates for me?
Lifehacker just wrote about Yep, a pdf library/organizer app.
I haven't used it, but it's free so you might as well try it.
posted by zazerr at 8:12 AM on November 6, 2006
I haven't used it, but it's free so you might as well try it.
posted by zazerr at 8:12 AM on November 6, 2006
File Buddy isn't free, but it did the best job for me when I was doing exactly this task. It will let you exclude certain differences like modification date, or compare only the data fork size, etc etc. And it's really fast--I scanned 27gb of mp3s and found several thousand dupes in just a few seconds.
It has a free trial if you want to give it a try. Or if you're only planning on cleaning things up this once maybe it will take care of the problem entirely for you.
(I hope this doesn't sound too much like I'm shilling for the program...it's just a utility I like, I swear!)
posted by bcwinters at 8:51 AM on November 6, 2006
It has a free trial if you want to give it a try. Or if you're only planning on cleaning things up this once maybe it will take care of the problem entirely for you.
(I hope this doesn't sound too much like I'm shilling for the program...it's just a utility I like, I swear!)
posted by bcwinters at 8:51 AM on November 6, 2006
Oh, and I should note that I was using one of the File Buddy 9 betas for this--I believe the duplicate processing feature is one of the things that was rewritten/overhauled in 9.
posted by bcwinters at 8:53 AM on November 6, 2006
posted by bcwinters at 8:53 AM on November 6, 2006
JakeLL's idea of comparing by size is the easy (and probably most time-effective) way of doing it.
The more accurate, but much more involved way, would be to calculate MD5 checksums for each file. (Using the "md5" command-line tool.) I can't find any nice programs to do all automaticlaly it (besides this one, which only compares two files at a time), so you'd probably have to write a basic shell script to do it.
I find it hard to believe that no one ever wrote a nice little graphical utility that simply generated MD5s for files and compared duplicates. (From skimming the FileBuddy site, it doesn't seem to do it.) I don't own a Mac, or else I would.
posted by fogster at 9:45 AM on November 6, 2006
The more accurate, but much more involved way, would be to calculate MD5 checksums for each file. (Using the "md5" command-line tool.) I can't find any nice programs to do all automaticlaly it (besides this one, which only compares two files at a time), so you'd probably have to write a basic shell script to do it.
I find it hard to believe that no one ever wrote a nice little graphical utility that simply generated MD5s for files and compared duplicates. (From skimming the FileBuddy site, it doesn't seem to do it.) I don't own a Mac, or else I would.
posted by fogster at 9:45 AM on November 6, 2006
The reason there isn't a utility to do this is that unix shell scripting can do it for you.
Open up a terminal, and cd to the directory where you keep your pdfs. Or just stay in your home directory and search for all pdfs in there. Then type:
find . -name "*.pdf" | xargs md5 | awk '{print $4, $2}' | sort
This should compute the md5 "checksum" for every pdf in your home directory. The 'awk' part will reverse the checksum and the file name, so that the 'sort' will sort by checksum. In the output of this command, any duplicate files should list next to each other.
I ran this on Mac OS X 10.4.8. If this doesn't work for you, I would try leaving off the "|awk ..." part of the command, and copy pasting the output into excel or something like that, so that you can sort it there.
posted by cotterpin at 10:49 AM on November 6, 2006
Open up a terminal, and cd to the directory where you keep your pdfs. Or just stay in your home directory and search for all pdfs in there. Then type:
find . -name "*.pdf" | xargs md5 | awk '{print $4, $2}' | sort
This should compute the md5 "checksum" for every pdf in your home directory. The 'awk' part will reverse the checksum and the file name, so that the 'sort' will sort by checksum. In the output of this command, any duplicate files should list next to each other.
I ran this on Mac OS X 10.4.8. If this doesn't work for you, I would try leaving off the "|awk ..." part of the command, and copy pasting the output into excel or something like that, so that you can sort it there.
posted by cotterpin at 10:49 AM on November 6, 2006
Best answer: Easier with Perl. Also, you need -print0 in case of spaces in the filenames:
posted by nicwolff at 11:20 AM on November 6, 2006
find . -name "*.pdf" -print0 | xargs -0 md5 | perl -lane '$c = pop @F; pop @F; if ( $cs{$c} ) { print "@F is identical to $cs{$c}" } else { $cs{$c} = join " ", @F }'
posted by nicwolff at 11:20 AM on November 6, 2006
Filemerge is a good visual diff tool that comes in the developer applications. (You need to install the devtools separately off your OS disks; they are not included by default.)
It is good for looking at differences between files, comparing code and the like. Not so much for just seeing if two files are dupes, but Googlers might want to know.
posted by yesno at 11:32 AM on November 6, 2006
It is good for looking at differences between files, comparing code and the like. Not so much for just seeing if two files are dupes, but Googlers might want to know.
posted by yesno at 11:32 AM on November 6, 2006
zazerr: thanks for the pointer to Yep -- it's a huge improvement over how I'd been organizing my PDFs previously.
posted by myeviltwin at 1:39 PM on November 6, 2006
posted by myeviltwin at 1:39 PM on November 6, 2006
This thread is closed to new comments.
Good luck with the free program route though.
posted by JakeLL at 8:07 AM on November 6, 2006