I'm Drowning in Images
April 25, 2012 6:16 PM   Subscribe

I have a couple hundred thousand digital photos. I have a bunch of duplicate images from overlapping backups. I have inconsistent naming and folder schemes. I have edited and cropped versions of the same image. I need some software to help me clean and sort my images.

I need one or many pieces of software that can:
  • De-duplicate Images -- I would prefer something simple that shows two images with dimensions & metadata then asks me if they are the same.
  • Add tags to the images -- Geo, face, and notes about each of the images. I need the tags to be written back into the images, so that in a few years other software can read the tags.
  • Image Versioning -- I want to be able to associate a edited version or a crop to the original image.
  • Web Browsable -- I want to be able to share all of my images online and let people view and sort based on tags, date, etc.
I will pay hundreds of dollars for this software, but most of the free-ish software that I've seen doesn't do this (Picasa, iPhoto, Flickr, etc). On the other end of the spectrum there are corporate grade digital asset management tools that run in the $10,000+ range that I can't afford.

I know this is a pretty specific feature set, and I want to make sure it doesn't exist before I look into writing it myself.
posted by gregr to Computers & Internet (8 answers total) 41 users marked this as a favorite
 
Easiest part - sort by md5sum, print duplicates. For similar images, maybe you sort by some function of the color space?
posted by gregglind at 6:51 PM on April 25, 2012


Response by poster: Doing an MD5 of the files should be pretty easy & catch a lot of the duplicates. For similar images I've looked into some kind of histogram comparison, but that looks like it might get a little involved.
posted by gregr at 7:26 PM on April 25, 2012


I spent several days writing custom AppleScripts for accomplishing this with Aperture. It's true —There is no commercial software that delivers decent results.
posted by mmdei at 7:59 PM on April 25, 2012


I suspect that identifying which images are resized versions of each other or similar images is a fairly hard problem. This thread has some suggestions; perceptual hashing appears to be another approach. The Scale-Invariant Feature Transform may be another approach - it's implemented in the FIJI distribution of ImageJ, among other places and I've used it a bit for image registration.
posted by pombe at 10:52 PM on April 25, 2012 [1 favorite]


These utilities need all the files on the same drive. I've used both, recommended:

To get rid of exact duplicate files, even if the filename has been changed, free DoubleKiller. Do this first. This program is not particularly image oriented, it checks all files by a computed checksum.

To de-duplicate images which may be different sizes/resolutions or resaved JPGs, in other words the same image in different formats, free VisiPics. This can help somewhat with versioning too. This one can be very slow, and for what it does that is understandable.

Tagging and web albums will need something else.
posted by caclwmr4 at 12:04 AM on April 26, 2012 [3 favorites]


If you are using a Mac, have a look at DupeZap for finding duplicates.
posted by conrad53 at 7:25 AM on April 26, 2012


I've done duplicate image detection using Fourier crosscorrelation. You'd want to do it on a small subsample of the images (like 128x128). It's simple enough that you can implement it in Python with the Numeric library in a couple of hours. It would still take quite a while to compare hundreds of thousands of images though, since each pairwise comparison would require a matrix multiplication.
posted by miyabo at 12:42 PM on April 26, 2012


(That method would be great for detecting cropped and resized images, but it wouldn't handle rotations at all.)
posted by miyabo at 12:43 PM on April 26, 2012


« Older It looks like two X's knit onto a pink polo?   |   How much time can I expect my partner to spend... Newer »
This thread is closed to new comments.