Help me consolidate thousands of pictures
January 25, 2007 11:07 AM

I have a bunch of directories of pictures across multiple disks. Many images are duplicated. Additionally, I had to use some disk recovery software to rescue other images and the filenames changed. At the moment, I can't guarantee that all of the filenames are unique. I'd like to weed out the duplicates and then ultimately consolidate everything into iPhoto. I'm talking close to 20,000 pictures.

I had two backup volumes fail simultaneously. I was able to recover some pictures from drive A, some from drive B, and not really know what I have in common. I'm sure there is LOTS of overlap.

Is there a program (hello perl wizards) that will weed out duplicates by CRC? I'm sure that through the various recovery procedures, the pictures were given different names. And I'm not quite sure the naming is unique, so I dont want to just

find . -name \*.jpg -exec mv {} mynewdirectory \;

for fear that I'll overwrite files. Plus that wont weed out the dupes. Any better ideas?
posted by neilkod to Computers & Internet (13 answers total) 8 users marked this as a favorite
Hopefully you'll get some good technical information/tips from people who deal with this routinely, but I've often wondered the same thing.

The best I could come up with in my brainstorming was to determine some non-obvious unique identifiers, and possibly do this in a few steps. filename is obvious, but maybe filesize and created on date.

sort by name and identify the ones with the same name. if the filesize and created on date are the same for each file, then I'd say there's strong evidence, they are indeed dupes.

then sort by filesize and see how many files are exactly the same size, even if the names are similar. this might require a quick verification between the two files in that rare case where the filesize is identical, but the pictures are indeed different.

my point is, do it in a few iterations to try to narrow down the 20,000 pics into maybe a hundred or so questionable ones that can't be automatically removed from the group.
posted by johnstein at 11:43 AM on January 25, 2007


If the files are identical in content, but named/dated differently, use:

md5sum {file}

This will spit out an id for the file (a good CRC if you will):

d2e96c2284b9bf3bc351e27b8a2d091d *foo.jpg

If you do this for all the files on each disk to get 2 lists, then you should be able to use a bit of unix sort/diff manipulation to weed out the common ones.
posted by azlondon at 12:01 PM on January 25, 2007


there are a few cheap or freeware solutions out there that identify duplicate images based on the image content.
posted by stupidsexyFlanders at 12:01 PM on January 25, 2007


find /dir -type f | md5sum | sort | uniq -d -w32

That will calculate the md5 hash of every file in the tree, and then print the names of duplicates.
posted by Rhomboid at 12:05 PM on January 25, 2007


Oops, the above is wrong, you'll need to use xargs.

find /dir -type f -print0 | xargs -0 md5sum | sort | uniq -d -w32

If the filenames exceed the maximum command length (which is a possibility) then you'll have to do something like:

(find /dir -type f -print0 | xargs -0 md5sum) | sort | uniq -d -w32

xargs will run md5sum as many times as necessary to sum all files, and the output will all go to a pipe and then to sort. Without the parens sort would only get the input of the first iteration of md5sum.
posted by Rhomboid at 12:09 PM on January 25, 2007


I use VisiPics. It's free and works great.
posted by SampleSize at 12:32 PM on January 25, 2007


fdupes does wonders.


NAME
fdupes - finds duplicate files in a given set of directories

SYNOPSIS
fdupes [ options ] DIRECTORY ...


DESCRIPTION
Searches the given path for duplicate files. Such files are found by
comparing file sizes and MD5 signatures, followed by a byte-by-byte
comparison.


OPTIONS
-r --recurse
include files residing in subdirectories

-s --symlinks
follow symlinked directories

-H --hardlinks
normally, when two or more files point to the same disk area
they are treated as non-duplicates; this option will change this
behavior

-n --noempty
exclude zero-length files from consideration

-f --omitfirst
omit the first file in each set of matches

-1 --sameline
list each set of matches on a single line

-S --size
show size of duplicate files

-q --quiet
hide progress indicator

-d --delete
prompt user for files to preserve, deleting all others (see
CAVEATS below)

-v --version
display fdupes version

-h --help
displays help


posted by zengargoyle at 12:42 PM on January 25, 2007


On Windows, I've had good luck with DuFF and D'peg!.

The latter is specifically designed to find duplicate photos, and will even find rotated, flipped, resized, and watermarked versions of otherwise identical pictures.

You probably don't need that functionality yet, but it's handy to have. In checksum mode, both programs are very fast, and D'peg is stabler and more mature. Duff has better options for dealing with the dupes, such as saving the list of dupes, moving them to another part of the tree, etc.
posted by Myself at 1:29 PM on January 25, 2007


If the pictures are duplicates, but one has been edited or resized, MD5 hashes won't find the dups. A program called DupDectector will go through and compare the actual content of the pictures. It requires some manual intervention, but it worked well for me.
posted by chrisamiller at 2:34 PM on January 25, 2007


I used clonespy (free) for this exact purpose and it worked perfectly. It's not just for pictures, either.
posted by desjardins at 4:14 PM on January 25, 2007


Rhomboid: Actually you don't need the parens. Xargs will run multiple invocations of md5sum if it needs to, but they'll all go to the same output pipe --- which is the input to sort. So your first command line should work fine.

Actually, it sounds like the poster is on a Mac, which has an 'md5' utility instead of an 'md5sum' utility (works basically the same though).
posted by hattifattener at 11:13 PM on January 25, 2007


In some cases, iPhoto lets you know when you're about to import a duplicate photo. I suggest you find a couple sets of differently-named dupes and test this out before you get medieval on the filesystem.
posted by sudama at 6:37 AM on January 26, 2007


Ok, I've actually done exactly this many months ago (or at least got a significant amount done). Caveat: I was only interested in my own personal digital pictures and not random stuff i downloaded. But here's what I ended up doing:

1) Find where all of my pictures are and put them under a directory tree. To remove the possibility of overwrites, I just made a new directory for every chunk of pics I found. You can use any number of tools to find pics. Doesn't matter how ugly the tree is.
2) Since the pics I was interested in sorting were from digital cameras they contained EXIF information. Using some programs like exifer I tree-collapsed and then batch renamed every single digital pic to the date-time it was taken (from the EXIF) field. I recommend the filename option of "yyyy-mm-dd-hh-mm-ss". This is so that the files sort correctly but has other benefits too.
3) So now all pictures are named by date-time and it is detailed enough to expose duplicates... i.e. 2 files with the same filename down to the second are highly likely to be the same picture. Now I re-organize all the pictures into a new tree by Year which is easy to do now. A shell script can do this for you easily if you wish but with this new naming convention, using the filemanager was not a problem.
4) I was now in a position to, year by year, go through the pictures and clump pictures into events. I chose to keep them sorted by year, and merely batch-add metadata to the images themselves (in IPTC or preferably WMP fields). I can imagine some will want to move them to subdirectories based on event and some will want to even batch rename groups of pictures. I've determined that it's cleaner to just keep the pictures sorted by year and named by date-time because it's consistent. I have an incoming pics folder and I batch-rename everything that comes in now and put it in its appropriate place.

Some comments:
* I for one believe that the metadata should be encoded in the picture and not in some database on your computer or even the directory tree. Pictures can move around a lot and that info will get lost unless it is kept within the file. EXIF is a great start but lacking. IPTC lets you put in more meaningful information and is very common but is tailored for news sources (that doesn't matter). I read WMP is the wave of the future and perusing it agree it should be, but I've had a hard time finding good (and free/cheap) picture organizers that support it properly. I REALLY hope this changes.

* Filenames for pictures can be a pain, cameras tend to give nearly-meaningless names like "IMG04932" or "DSC2342349". So it's always a good idea to rename them. However, as you try to come up with something short but meaningful yourself then as you accumulate more and more, you eventually end up with names nearly as meaning-less in the grand collection. This is why I decided that date-time naming was the way to go. It's simple, stress-free, unique (or nearly so), self-organized, and canonical.

* Links to some photo organizers.. I added a couple near the bottom.

* exifer is a windows program, I've had trouble finding a suitable replacement on Linux but there is stuff out there. there are also many explicit picture-dupe finders as well, as mentioned by previous commenters.


* sadly, this method doesn't apply to my scanned images, I'm working on a system for that too and will get back to it at some point.

Hope this helps.
posted by mikshir at 10:53 AM on January 26, 2007


« Older Were there any real pre-antibiotic cures for TB?   |   Customizing the size of the tex-shell window Newer »
This thread is closed to new comments.