7TB of photos, want a Mac app to eliminate EXACT duplicates only
December 30, 2020 1:30 AM   Subscribe

I am trying to de-dupe multiple terabytes of photos. I want a Mac app that will find all of the duplicate images (and only duplicates, not "similar" ones).

So, I have copied 7TB of folders of image files onto a 5TB and 2TB hard drive and am now faced with the task of finding and getting rid of all of the duplicate images.

(All of these images come from hundreds of folders from three different Macs, several thumb drives, and about ten various backup hard drives, so I have no doubt that terabytes of these images are duplicates.)

I have been thinking that I would first do a search for all default-titled photos (with names like DSC_000x.jpg or IMG_000x.jpg) and use the app A Better Finder Rename to rename them with the prefix of the EXIF date-and-time they were taken (so I would end up with jpegs with names like 2020-12-25 18.30.00 - DSC_0001.jpg). I feel like this would help me in the future if I wanted to sort them into folders by month or something, which would also help make it easier to keep track of uploading them to Flickr (which tends to do better with smaller upload batches).

Anyway...

Can anyone recommend a Mac app that they have used that has been rock-solid at finding duplicate photos?

As a teacher, I work with fast-moving young children, so a lot of my photos are taken a hundred at a time in "burst" mode where I hold down my dSLR’s shutter and take several photos per second; for this reason,

I want an app that looks for EXACT duplicates only, not similar images taken at the same second but 0.2 seconds later.

Also, I need the app to be smart enough to identify duplicates even if they have totally different names

[such as Maggie’s First Lost Tooth.jpg and 2015-11-15 13.15.26 - IMG_8675.jpg being the same photo]

Finally, I'd like an app that you can just point at a hard drive or folder, not one where I have to import all of the photos into some sort of database. I like being able to see and have access to all of my files in The Finder (and I don't really trust iPhoto (now Photos) after it erased all of my photos about a decade ago).

Suggestions?
posted by blueberry to Computers & Internet (22 answers total) 16 users marked this as a favorite
 
If you don't mind using a CLI and homebrew, I'd recommend rdfind. It has a nifty method to minimise the count of files that really actually need to be totally compared, which helps with speed on large file sets.
posted by pompomtom at 2:08 AM on December 30, 2020 [1 favorite]


(I note the github page mentions macports, but I'm pretty sure brew install rdfind will work to install it).
posted by pompomtom at 2:22 AM on December 30, 2020


Response by poster: Thanks pompomtom, but I should add that as a visual learner, I absolutely require an app with a graphical user interface.
(no command line stuff where actions happen out of sight, or where I might mistakenly type the wrong character and end up deleting everything)
posted by blueberry at 2:50 AM on December 30, 2020


no command line stuff where actions happen out of sight,

Just running it will generate a list of possible dups, that you can then go through to evaluate.

rdfind [options] directory_or_file_1 [directory_or_file_2] [directory_or_file_3] ...
Without options, a results file will be created in the current directory.


When I had to do something similar, but with files of any type (compressed, pdfs, images, documents and more; file dates could have been changed so could not be relied upon) I crafted a workflow comparable to what rdfind does, only in separate steps:
- create a list of file names and their sizes, sorted by size.
- scrub the list, removing all entries that have an unique size.
- generate md5sum values for all the remaining items, sorted by this value.
- items listed with the same md5sum may be identical and are slated for comparison.
posted by Stoneshop at 3:53 AM on December 30, 2020 [3 favorites]


Response by poster: Please (!), no more suggestions about anything requiring command line or entering text commands.

Again, I absolutely require an app with a graphical user interface. If you can’t help in the manner requested, please refrain from suggesting unwanted alternatives.
posted by blueberry at 4:17 AM on December 30, 2020 [1 favorite]


Have you tried Gemini? Here’s a review for it. I’ve used it some before with success.

Also there is the shockingly named Duplicate Photo Finder...

Here are some more to check out.
posted by rambling wanderlust at 4:49 AM on December 30, 2020 [2 favorites]


I use DupeGuru for this exact use case. Gemini has a nicer user interface but DupeGuru is free and kinda no-frills. It can be set to "find the exact duplicate files, ignoring the filename, and just keep one of them" or it can be set to dig into photographic data to find close-ish matches (like showing you photos that are resized versions of each other, or photos that differ only by small increments like the "burst" photos you mentioned—in case you want to take a second pass on your files another time to do this sort of cleanup). It's very fast; I haven't used it on as many files as you are dealing with but I have certainly done 500-750GB at a pass with no issues.

I've also used Power Photos in situations where some or all of the photos are in iPhoto/Photos libraries (other tools can of course dig inside the folder structure libraries but Power Photos is "smarter" about dealing with originals and edits).

If you need a tool to reorganize the files once you're done, I like Big Mean Folder Machine from the makers of A Better Finder Rename which you've already mentioned.
posted by bcwinters at 6:29 AM on December 30, 2020 [7 favorites]


I’ve used Gemini (recommended above). It worked well.
posted by caek at 7:52 AM on December 30, 2020 [1 favorite]


CCleaner on Windows has a robust duplicate finder. Presumably it has similar functionality on Mac...
posted by jmfitch at 7:58 AM on December 30, 2020


Maybe QuickHash can help.

The idea is that you have a directory/folder of images, and you calculate hashes for each of them. For this purpose, a hash can be thought of as a fingerprint that you can use to identify duplicate files. Regardless of the filenames of two or more files, so long as those files contain the exact same data, they will generate the same hash or fingerprint. You can use that property to spot duplicates and filter them.

So, basically, run this tool on your folder(s) of images. Take the listing of files and their hashes and bring it into a spreadsheet application for the Mac, such as Excel, or the free Numbers app that Apple distributes through their app store. Sort the spreadsheet on the column of hashes. Any files with the same hashes are duplicates. You can then pick the one you want, where there are duplicates.
posted by They sucked his brains out! at 9:09 AM on December 30, 2020


In any case, speed could potentially be an issue with hashing 7 TB of images. The QuickHash tool supports xxHash, which is a fast hash function, and from recent comparisons I've seen with SHA-1 and MD5, probably the fastest of the most common hashing functions out there, or at least among the best performing, speed-wise.

I don't know how DupeGuru and other tools are written, but I imagine that they use similar comparisons internally in order to identify dupes, so taking a closer look at that detail may be useful.
posted by They sucked his brains out! at 10:49 AM on December 30, 2020


I use DupeGuru for this exact use case.

I've also used it (albeit on Windows) to solve this problem. It has a GUI and didn't require much fiddling to get the result I wanted.
posted by Urtylug at 11:17 AM on December 30, 2020 [1 favorite]


PhotoSweeper will do this, I believe. I've only used it on the "these are super-duper similar" setting, but it has an option for "Duplicate Files" as well. I tried it just now and it looks like it does what you want.
posted by The corpse in the library at 11:24 AM on December 30, 2020 [1 favorite]


> Finally, I'd like an app that you can just point at a hard drive or folder, not one where I have to import all of the photos into some sort of database.

Oh, sorry, it might not do that the way you want it to. It's worth downloading the trial version, though.
posted by The corpse in the library at 11:25 AM on December 30, 2020


DupeGuru has worked well for me on Windows. There's a "fast" mode that just compares file info and a "slow" mode that actually inspects contents. For photos fast mode is probably good enough.
posted by neckro23 at 12:41 PM on December 30, 2020


Use fdupes - 1, 2
posted by GiveUpNed at 5:41 PM on December 30, 2020


I did not find Gemini to be very good when attempting a similar task, fwiw. I’ll be checking out some of the options listed above.
posted by bluloo at 11:47 PM on December 30, 2020


Response by poster: bluloo, what problems did you have with Gemini? That it didn’t fall all of the duplicates? Or that it falsely tagged some non-duplicates as duplicates? Or something else?
posted by blueberry at 2:21 AM on December 31, 2020


I found it wasn’t flagging all of the duplicates. And these were exact duplicates, as I had merged two libraries with significant overlap.
posted by bluloo at 9:45 AM on December 31, 2020 [1 favorite]


I think this will do what you want: Gemini It will find both exact duplicates and similar photos, but you can choose to only delete the exact matches.

I've used it on smaller amounts of photos than your selection.
posted by vegetableagony at 10:20 AM on January 3, 2021 [1 favorite]


(OT) They sucked his brains out! :

Take the listing of files and their hashes and bring it into a spreadsheet application for the Mac, such as Excel..

...

In any case, speed could potentially be an issue with hashing 7 TB of images.

...

I imagine that they use similar comparisons internally


Better to sort by file-size first. Different sizes won't be duplicate files.
Then sort by first n (let's say 1024) bits. Different first-n bits won't be duplicate files.
Then sort by last n (let's say 1024) bits. Different last-n bits won't be duplicate files.
Hashing is expensive and should be the second-to-last stage, and then maybe just jump to a comparison.
posted by pompomtom at 5:29 AM on January 5, 2021


I agree, but I think you'll need a CLI tool to generate those data (easily), and as the asker has made more than clear, command-line solutions are off the table, unless there's a GUI tool for these data and exporting them easily. Hashing a bunch of folders of images may take a while, but will give a correct answer.
posted by They sucked his brains out! at 1:31 PM on January 5, 2021


« Older I'm pregnant and my husband is having a breakdown   |   Does foccacia dough need to rise? Newer »
This thread is closed to new comments.