Software for finding picture duplicates
June 25, 2011 7:06 AM Subscribe
Are there any Windows programs that can compare two image files and tell me if they're duplicates or not, ignoring different metadata and file formats?
I'm thinking of something similar to foobar2000's bit-comparison for audio files, where if you compare a flac file against a wav file it will tell you whether or not they output exactly the same data when played. I want this feature but for pixels instead of audio samples.
Bonus points for freeware options; extra bonus points if the program can verify whole directories and identify all duplicates inside them.
I'm thinking of something similar to foobar2000's bit-comparison for audio files, where if you compare a flac file against a wav file it will tell you whether or not they output exactly the same data when played. I want this feature but for pixels instead of audio samples.
Bonus points for freeware options; extra bonus points if the program can verify whole directories and identify all duplicates inside them.
I've used Duplicate File Detective in trimming my offices file storage server duplicates with great success. It can generate a hash for the file based on factors you give it, and IIRC it will find duplicated images.
posted by msbutah at 7:24 AM on June 25, 2011 [1 favorite]
posted by msbutah at 7:24 AM on June 25, 2011 [1 favorite]
Response by poster: burnmp3s: I can decompress music.mp3 (lossy) to music.wav (theoretically lossless and much bigger filesize, but since it came from the mp3 the audio data is actually lossy) and foobar will correctly report both as the same.
Likewise, if I open pic.jpeg in MSPaint and save it as pic.bmp, they both output the same image when opened (at least that's my understanding of it, the pixels displayed in my screen will be identical for both files).
On preview: msbutah, the program in your link does to audio files the opposite of what I intend to do to image files, it will analyse metadata instead of ignoring it. It can also compare whole files by checksumming, but that would report my pic.jpeg and pic.bmp files as different when they contain the same image.
posted by Bangaioh at 7:38 AM on June 25, 2011
Likewise, if I open pic.jpeg in MSPaint and save it as pic.bmp, they both output the same image when opened (at least that's my understanding of it, the pixels displayed in my screen will be identical for both files).
On preview: msbutah, the program in your link does to audio files the opposite of what I intend to do to image files, it will analyse metadata instead of ignoring it. It can also compare whole files by checksumming, but that would report my pic.jpeg and pic.bmp files as different when they contain the same image.
posted by Bangaioh at 7:38 AM on June 25, 2011
Bangaioh: I'm afraid you are wrong there.
1st: Foobar is doing a close enough match. Music players have to do this all the time because of differences in bitrate, encoder and the exact amount of space left around the music.
The problem is your claim about pic.bmp and pic.jpg. You can try it: Find a png, a lossless image. Open up MSPaint and resave it as pic.jpb, then zoom in until you can see it pixle by pixle: You'll be able to see the differences right away. The JPEG blurs the lines between colours and washes out small differences in colour to save space. It is much easier to say 'Row of 14 red pixles' then '5 red, 4 light red, then 5 more red' basically. Therefore your algorithm will have to convert everything to bmp and do a close enough match.
I've never heard of such a program- It would take an unholy amount of time to run, though it is in theory possible. If you find anything let me know.
I have heard of programs that attempt to match images by color/shapes and such, but you would have to hand verify each match as they tend to be very inaccurate. Also I've only ever seen them on websites.
posted by Canageek at 7:51 AM on June 25, 2011
1st: Foobar is doing a close enough match. Music players have to do this all the time because of differences in bitrate, encoder and the exact amount of space left around the music.
The problem is your claim about pic.bmp and pic.jpg. You can try it: Find a png, a lossless image. Open up MSPaint and resave it as pic.jpb, then zoom in until you can see it pixle by pixle: You'll be able to see the differences right away. The JPEG blurs the lines between colours and washes out small differences in colour to save space. It is much easier to say 'Row of 14 red pixles' then '5 red, 4 light red, then 5 more red' basically. Therefore your algorithm will have to convert everything to bmp and do a close enough match.
I've never heard of such a program- It would take an unholy amount of time to run, though it is in theory possible. If you find anything let me know.
I have heard of programs that attempt to match images by color/shapes and such, but you would have to hand verify each match as they tend to be very inaccurate. Also I've only ever seen them on websites.
posted by Canageek at 7:51 AM on June 25, 2011
Canageek, the OP is talking about comparing a BMP saved from a JPG. So the decoded BMP and the original JPG should be identical.
posted by Gyan at 8:22 AM on June 25, 2011
posted by Gyan at 8:22 AM on June 25, 2011
I used a program called ODIN II to do this around year 2000. The people who made it aren't around any more, but you can still get it off of freeware file sites. It actually still works OK (Windows Vista) which I think is pretty amazing. The default setting works really well for color (not so well for grayscale). Start by opening the help file to Getting Started, the interface is non-obvious. It can take a while if you have a lot of images, although not as slow as it was back then (thank you, Moore).
Things which don't work: No start menu shortcut (probably because the location of the start menu has moved), and the file types supported are very year 2000--no RAW.
Something which was not obvious to me 11 years ago: The person who made ODIN was trying to remove duplicates from porn downloaded off of Usenet. The examples in the help file include women in short skirts. The "Getting Started" page in the help file has no images.
posted by anaelith at 8:25 AM on June 25, 2011
Things which don't work: No start menu shortcut (probably because the location of the start menu has moved), and the file types supported are very year 2000--no RAW.
Something which was not obvious to me 11 years ago: The person who made ODIN was trying to remove duplicates from porn downloaded off of Usenet. The examples in the help file include women in short skirts. The "Getting Started" page in the help file has no images.
posted by anaelith at 8:25 AM on June 25, 2011
Response by poster: You're right, Canageek, I've just tested it and music.mp3 and music.wav are indeed different, and foobar reports them as different files. I would have sworn I had tested this before but as it turns out I was talking out of my arse. Sorry, everyone!
> Find a png, a lossless image. Open up MSPaint and resave it as pic.jpb
That's backwards, I'd be losing information when saving as jpg, so of course pic.jpg would be different from pic.png.
> Therefore your algorithm will have to convert everything to bmp and do a close enough match.
Not close enough, I want exact matching. It would convert everything to bmp and compare pixel by pixel.
I know it is possible for jpegs with different tags (really, just tested it!), but I don't know if for other file types I'd encounter something similar to the mp3>wav situation, where the conversion to bmp would necessarily result in different pixels.
I will now look at anaelith's suggestion.
posted by Bangaioh at 8:37 AM on June 25, 2011
> Find a png, a lossless image. Open up MSPaint and resave it as pic.jpb
That's backwards, I'd be losing information when saving as jpg, so of course pic.jpg would be different from pic.png.
> Therefore your algorithm will have to convert everything to bmp and do a close enough match.
Not close enough, I want exact matching. It would convert everything to bmp and compare pixel by pixel.
I know it is possible for jpegs with different tags (really, just tested it!), but I don't know if for other file types I'd encounter something similar to the mp3>wav situation, where the conversion to bmp would necessarily result in different pixels.
I will now look at anaelith's suggestion.
posted by Bangaioh at 8:37 AM on June 25, 2011
Response by poster: Unfortunately I couldn't get ODIN II to work, it always gives out errors when I try to "Gather Image Data", and "Start Narration" doesn't detect anything despite the directory being analysed containing loads of duplicates.
However, after reading the help file it seems to work by either "close enough" matching, or filesize and/or dimension comparison, none of which being what I'm looking for.
posted by Bangaioh at 9:40 AM on June 25, 2011
However, after reading the help file it seems to work by either "close enough" matching, or filesize and/or dimension comparison, none of which being what I'm looking for.
posted by Bangaioh at 9:40 AM on June 25, 2011
Best answer: You could probably use imagemagick to do it. convert bmp, jpg, tiff or whatever to some common lossless format and then compare bytes. some unix and scripting knowledge would be necessary for automating the process.
posted by DarkForest at 10:13 AM on June 25, 2011
posted by DarkForest at 10:13 AM on June 25, 2011
I've used ImageDupeless. It's not great but it is a useful tool.
http://www.imagedupeless.com/en/index.html
posted by lemniskate at 1:20 PM on June 25, 2011
http://www.imagedupeless.com/en/index.html
posted by lemniskate at 1:20 PM on June 25, 2011
Best answer: 1) install cygwin and netpbm
2) write a script to take a hash of every image using netpbm commands, something along the lines of for i in *.jpg; do echo $i `anytopbm $i | md5sum`; done
3) compare this output for duplicates using tool of your choice (maybe a spreadsheet)
Admittedly learning cygwin and netpbm may take you a day or so, but they're really freaking useful for other things so it's worth it.
posted by miyabo at 9:11 PM on June 25, 2011
2) write a script to take a hash of every image using netpbm commands, something along the lines of for i in *.jpg; do echo $i `anytopbm $i | md5sum`; done
3) compare this output for duplicates using tool of your choice (maybe a spreadsheet)
Admittedly learning cygwin and netpbm may take you a day or so, but they're really freaking useful for other things so it's worth it.
posted by miyabo at 9:11 PM on June 25, 2011
Best answer: What DarkForest said. You can use ImageMagick to convert both files to a raw RGB format--nothing but the pixel data, not even a header with the size or anything--which will get rid of the metadata quite nicely. Then you just diff the files.
posted by equalpants at 2:26 AM on June 26, 2011
posted by equalpants at 2:26 AM on June 26, 2011
Response by poster: ImageDupeless is almost there but not quite: with difference level set to 0% it doesn't detect any duplicates at all (even when comparing files which are exactly the same, only differing in filename) and when set to 1% or greater it gives out false positives. With my very limited testing, it seems nice for fuzzy matching but that's not what I want.
Scripting is out of my league and something I'd like to avoid at all costs but for now it seems my only option, I'll look into imagemagick and netpbm in the future. I don't need a completely automated process, just something that can recurse through a directory converting every jpeg to a raw RGB file with the same filename like equalpants prescribed and I can pick it up from there.
posted by Bangaioh at 4:10 AM on June 26, 2011
Scripting is out of my league and something I'd like to avoid at all costs but for now it seems my only option, I'll look into imagemagick and netpbm in the future. I don't need a completely automated process, just something that can recurse through a directory converting every jpeg to a raw RGB file with the same filename like equalpants prescribed and I can pick it up from there.
posted by Bangaioh at 4:10 AM on June 26, 2011
Ah, sorry, I got it backwards.
I now remember hearing about this when I listened to the podcast 'Cyberspeak' on computer forensics. They used a fuzzy logic search tool when analyzing suspects computers for child pornography images or such, in case the image had the first couple bytes or such deleted, which would totally change the hashsums of a normal hash, but not a fuzzy hash. You could try starting their, and if you can't find what you are looking for email them. They may have heard of something.
posted by Canageek at 1:37 PM on July 3, 2011
I now remember hearing about this when I listened to the podcast 'Cyberspeak' on computer forensics. They used a fuzzy logic search tool when analyzing suspects computers for child pornography images or such, in case the image had the first couple bytes or such deleted, which would totally change the hashsums of a normal hash, but not a fuzzy hash. You could try starting their, and if you can't find what you are looking for email them. They may have heard of something.
posted by Canageek at 1:37 PM on July 3, 2011
« Older How can I choose the best option for my kid? | Jewelry Experts: What is this jewelry mark? C and... Newer »
This thread is closed to new comments.
Note that while flac and wav are both containers for lossless audio, jpeg uses lossy compression. So you would not really be able to compare a jpeg versus a gif or a different jpeg from the same source exactly, it would have to use some sort of algorithm to figure out if they were "close enough".
posted by burnmp3s at 7:17 AM on June 25, 2011