On a Mac: find duplicate music files in a folder full of subfolders
May 18, 2023 4:09 PM Subscribe

I have 160 GB of music on an external drive. I know that nearly 50% of it is duplicate files (songs sorted multiple ways into multiple folders over multiple years) and I want to delete all duplicates. I tried a Terminal command string I found in a tutorial and I'm afraid it is going to run forever. Are there other (easier) options?

My old iTunes library is on an external drive, and the new Music app (I want iTunes back) won't let you have your library on an external drive. I have to copy all of my music to my laptop drive, which can't hold 160 GB of music on top of everything else, so I want to get rid of all of the dupes before any further pruning takes place.

I watched this video on YouTube and it led me to this page which allowed me to copy out the exact command string and paste it into Terminal. I haven't gotten the prompt back again, so I can only assume it's still running.

I tried to use Smart Folders in the Finder, but it would only let me search the computer, not an external drive. (If I'm missing something obvious here please do let me know.)

Any and all suggestions welcome, at this point -- I'd prefer free options, but if there's software available that finds duplicates that costs money, I'll pay if it's a good product.

posted by tzikeh to Computers & Internet (17 answers total) 6 users marked this as a favorite

HoudahSpot will find all instances of a particular name e.g., files beginning with xxyy with the file extension mp4, and then duplicates and triplicates can be deleted as needed.
posted by yclipse at 4:45 PM on May 18, 2023

This isn't a direct answer to your question, but if you want to watch it work, you can open another Terminal window and type the command tail -f /tmp/filelist.tmp.
posted by panic at 5:41 PM on May 18, 2023 [1 favorite]

If you want to use a GUI tool instead of something on the command line, DupeGuru is free and works great. It's not super pretty but I can't imagine it taking more than a couple of minutes to run through 160gb of mp3s on a slow external drive; not sure why your script got hung up.
posted by bcwinters at 5:50 PM on May 18, 2023 [1 favorite]

If you want to do it by hand, you can do it in the Finder.

Open the folder containing all the music files
Click on the little Search button in the upper right corner of the Finder window
Type "Audio" (without the quotes)
When the pop-up menu appears, choose "Kinds:Audio"
Sort the files by size and/or by name and delete the duplicates

posted by Winnie the Proust at 6:09 PM on May 18, 2023 [2 favorites]

yclipse: HoudahSpot will find all instances of a particular name

That seems to be a good program for looking for a specific file or files, but not for listing thousands of duplicate files.

panic: if you want to watch it work

Hilariously, it finished up and failed at its one job: finding duplicates. Apparently, there are files that have the exact same checksum and the exact same size (the two things the command is using to find duplicates), but are not, in fact, the same song. Click to see screencaps of the results.

bcwinters: DupeGuru is free and works great... not sure why your script got hung up.

Unfortunately, DupeGuru gives me this error message when I try to double-click open it: "dupeguru can’t be opened because Apple cannot check it for malicious software. This software needs to be updated. Contact the developer for more information."

Winnie the Proust: If you want to do it by hand, you can do it in the Finder.

As I wrote in the post, I tried the Finder and it wouldn't let me choose the external drive to search. I opened the folder on the external drive where the music is, then clicked on the Search magnifying glass. No matter what I typed in the search field, it would only search my laptop, as if I had never chosen the external drive. I checked with Disk Utility and the external drive is not NTFS-formatted, it's ExFAT, so it should be searchable.
posted by tzikeh at 8:12 PM on May 18, 2023

Can you possibly list all the files into a text file, then import that list in to a spreadsheet and check that way? If you're on a Mac, aren't they running Unix? So then a Unix command line solution might work.

Something like:

1) ls -la (but with a recursion modifier - you can figure this out) > completefilelist.txt

2) import the file completefilelist.txt into LibreOffice Calc (spreadsheet program) or an online spreadsheet if you prefer, using space or / as a delimiter. If you can only do one delimiter at a time, accomplish this in two phases (import once, copy the still-stuck-together column into another spreadsheet, save as text, import that with the other delimiter, copy/paste next to the original split columns)

3) sort the whole thing by the final part of the file path

4) find some sort of macro to delete any rows with a unique file name, or otherwise identify duplicates. Duplicated records is an old, old issue, so there's bound to be something out there to help you.
posted by amtho at 8:40 PM on May 18, 2023

I use rdfind for this sort of task. You'd need to use Macports to install it, so that may be a barrier. It finds exact duplicates, but it's quickish because it starts by eliminating lots of classes of obviously not-duplicates (by filesize, first x bytes, last x bytes etc), then goes to checksums, and eventually, if necessary, actual file comparisons.

NB: When I use it, I set it to create hardlinks, so in the end you still have all the file handles, but just less redundant data (hence: rdfind).
posted by pompomtom at 9:20 PM on May 18, 2023

(or you can just have it spit out a report telling you where your dupes are)
posted by pompomtom at 9:21 PM on May 18, 2023

If you want all the names, file sizes and the path to each file in a format that can be easily loaded into a spreadsheet:
find ~ -type f -name "*mp3" -printf "\"%f\",\"%s\",\"%p\" \n" > ~/tmp/filelist.csv
after which you can, as amtho already suggested, sort on just the name or the file size to see possible duplicates.

Leave out the -name "*mp3" if there are other filename extensions in use, rerun the command with these as the name using ... >> ~/tmp/filelist.csv each next run so that it all ends up in one csv file, or use
find ~ -type f \( -name "*mp3" -o -name "*flac" -o -name "*whatever" \) -printf ...
You can keep adding -o -name until you run up against the maximum command line length or you're out of extension types to add, whichever comes first.

The csv import might trip on file names that contain double quotes, haven't tested that.

The checksum method in the suggested command line as such is sound, the way they used it is not because they ONLY use file size and checksums to find duplicates, then use the checksum/file size pairs of which there are multiple to find the name(s) of the files in the list that was created. It doesn't try to match on file names in any way. And multimedia files will often contain metadata like "number of times played" which will give you a different checksum for each of the copies. File sizes may turn out up being a few kb different too that way, but once you have the list in a spreadsheet like suggested above you can probably apply some low-level wizardry to check if files with similar names (simply ignoring case and whitespace, for instance) are less than a few kb different in size.
posted by Stoneshop at 10:58 PM on May 18, 2023

Gemini 2?
posted by Grangousier at 11:24 PM on May 18, 2023 [2 favorites]

Long shot: if you're comfortable with the terminal, Camilla Berglund's duff (duplicate file finder) will list all but one (in excess mode) of clusters of duplicate files. As Apple Music creates its own folder structure from whatever you drop on it, it won't matter that the source files are kind of disorganized.
posted by scruss at 5:28 AM on May 19, 2023

Apparently, there are files that have the exact same checksum and the exact same size

Those all look like 4096 byte files with names starting with "._" . Files like that are metadata files created by MacOS to store icons and so on, so they're the same because they have the same icon. You see them a lot when you copy files from Mac to Windows on an external drive.

I'm not sure why they showed up -- the "-size +1M" option in the find command should have filtered them out. But I think the command sorts by size, so all of the metadata files should be in one big group.
posted by ectabo at 6:29 AM on May 19, 2023 [1 favorite]

[The Finder] wouldn't let me choose the external drive to search.

This answer to a question on Stack Exchange might provide a way for Spotlight to index your external volume. That might allow it to be searched.

If the Finder simply won't search ExFat volumes, you could copy the contents to a different external drive that is formatted using APFS. Not ideal, I know.
posted by Winnie the Proust at 7:00 AM on May 19, 2023

Unfortunately, DupeGuru gives me this error message when I try to double-click open it: "dupeguru can’t be opened because Apple cannot check it for malicious software. This software needs to be updated. Contact the developer for more information."

This happens with basically any unsigned app on MacOS now. You can go into "Security & Privacy" in Preferences and there should be an "Open Anyway" button you can click that will let you run it.

(I've used DupeGuru myself in the past, it works pretty well. I used the Windows version though.)
posted by neckro23 at 8:38 AM on May 19, 2023

Here's the command line you found to paste into Terminal:

find . -type f -size +1M -exec cksum {} \; | tee /tmp/filelist.tmp | cut -f 1,2 -d ' ' | sort | uniq -d | grep -hif - /tmp/filelist.tmp | sort -nrk2; rm /tmp/filelist.tmp

Many of the replies on the site where that was posted report failures at the grep -hif - /tmp/filelist.tmp step. Here's why those failures occur.

The command line as given consists of two pipelines separated by a semicolon. The first pipeline runs first, and the second runs only after the first has completed.

The second pipeline is just a single command (rm /tmp/filelist.tmp) that removes the /tmp/filelist.tmp file created by the first. You could leave it out entirely if you don't mind a little cruft hanging about in /tmp until your next reboot.

The first pipeline is more interesting. It runs all of the following commands in parallel, with the standard output stream of each feeding into the standard input stream of the next. The pipeline completes when its last command does.

find . -type f -size +1M -exec cksum {} \;
tee /tmp/filelist.tmp
cut -f 1,2 -d ' '
sort
uniq -d
grep -hif - /tmp/filelist.tmp
sort -nrk2

Because all of these commands are being invoked in parallel, the order in which they become ready to run is not guaranteed. Specifically, there is no guarantee that the command tee /tmp/filelist.tmp will have been running for long enough to create /tmp/filelist.tmp before the grep -hif - /tmp/filelist.tmp command attempts to open that same file in order to search it. Sometimes it will, sometimes it won't. When it doesn't, grep will complain about not being able to open the file it's being asked to search, and that's where the reported errors originate.

This kind of behaviour is called a race condition, and it's a super common coding mistake. Let's have a look at what the commands are actually doing, and see if we can improve things.

find . -type f -size +1M -exec cksum {} \; examines every entry in the current directory aka folder (denoted by .) and all its subdirectories. For each entry that describes a file rather than a nested subdirectory (-type f) where the size of that file is over a mebibyte (-size +1M) it invokes the cksum command (-exec cksum {} \;), substituting the pathname of the file it's just found in place of the {}. The \; marker tells find where the end of the cksum command line is.

Each invocation of cksum emits a line of text that consists of a CRC checksum, a file length and the file's pathname, separated by spaces. Note that the manual page I linked here says that the file length is given in octets (aka bytes), but its summary says "display file checksums and block counts" - and since your screenshot shows "4096" for so many of the second fields, I'm inclined to believe that the lengths it emits reflect space that the files occupy on the disk rather than being accurate application-level lengths.

Because cksum is invoked from inside find and find isn't producing any output of its own, the output stream of find will be all of the outputs from all of its invocations of cksum, one after the other. That whole output stream gets piped into tee /tmp/filelist.tmp which saves a copy into /tmp/filelist.tmp as well as copying it to its own output stream to feed along the pipeline.

The stream is then filtered through cut -f 1,2 -d ' ' which passes along only the first two fields from each line i.e. the checksum and the length, discarding the file names. The -d ' ' option tells cut to expect and write spaces as field delimiters rather than the tab characters it would use by default.

What is now just a big list of checksums and lengths then gets fed into the input of sort, invoked with no arguments. This does a couple of things: it delays all output until it's seen an end-of-file arrive on its input stream, then sorts all of that input, line by line, into alphabetical order before writing it all to its output stream. This is done so that entries derived from files with identical checksums and lengths will end up grouped on successive output lines.

Next step is filtering the now-sorted list of checksums and lengths through uniq -d. This emits one copy of each input line that's part of a group of identical ones and this is the actual duplicate-finding step; the assumption is that if a file has the same CRC checksum and the same length as some other file, then it's a duplicate of that other file regardless of how they're named or which folders they're in.

Now it's time to look up the duplicated checksums and lengths in the original file list so that the user can find out where they are. The output from uniq gets piped to the input of grep -hif - /tmp/filelist.tmp and this is where the trouble starts.

grep treats -hif as a composite option that means the same thing as -h -i -f. The -h option tells grep not to list the name of the file(s) in which it found the things it's been told to search for in the search results it emits. Since in this case we're only searching inside a single file (/tmp/filelist.tmp) that's reasonable. The -i option tells it to treat uppercase and lowercase letters as the same, which is weird here because the only things it's going to be searching for are strings of numeric digits. Finally, -f tells it to get a list of patterns to search for from a file, rather than having them specified directly on the command line as is more usual, and the filename attached to -f is - which is shorthand for the command's own standard input stream. The effect is to make grep use the entire list of checksum+length pairs that came down the pipe as patterns to search for inside /tmp/filelist.tmp.

It won't start performing the actual search until it's finished reading that list, which it can't do until all the commands upstream in the pipeline have completed and closed their outputs, which won't happen until after tee has written everything out to filelist.tmp that it was ever going to. So the search itself is sound if the race condition actually allowed grep to open filelist.tmp in the first place.

The final pipeline component is sort -nrk2, which sorts the output from grep numerically (-n) in descending order (-r) on the second whitespace-delimited key in each line (-k2), so you get to see the biggest duplicates listed first.

So to improvements: first thing is getting rid of the race condition. This is most easily done by not trying to jam this whole thing into one huge pipeline, but instead generating the file list explicitly in its own step.

Next, I wouldn't use CRC32 checksums to identify duplicate content. CRC32 is unlikely to generate false positives especially in conjunction with length, but I prefer cryptographic checksums for this kind of work because they are never going to generate a false positive and even on my shitty old laptop they're no slower to generate. So I'd use sha512sum instead.

Rather than sticking the resulting list in a temporary file and deleting it afterwards, I'd keep it around. Big lists of file checksums are handy for bit-rot testing, especially for content you're likely to want to back up across multiple external devices.

Therefore, the first thing is to make a big list of checksums and I'd do it like this:

find . -type f -exec sha512sum {} + | tee shasums.txt

You'll see the results scroll by in the Terminal window as they're calculated. If you run this after cd-ing into a folder containing tens to hundreds of gigabytes of files, especially if it's on an external drive and double especially if the external drive is USB-2 rather than USB-3, expect it to take a long time to run because it's reading every byte of every file.

Note that I'm using + rather than \; to terminate the command line for the -exec option here. What this does is make find perform fewer invocations of sha512sum; rather than one invocation per file, it does them in batches with as many pathnames jammed into a each sha512sum command line as the OS allows. The output will be the same, just marginally less slow.

If you only care about large files you can add the -size +1M option between -type f and -exec the same way the original command line does. And if you'd rather just have it chunter away quietly for a couple of hours without spamming your Terminal window with scrolling gibberish, just use

find . -type f -exec sha512sum {} + >shasums.txt

If you expect all your duplicates to have the same filenames because they arose purely as a result of copying files to multiple folders, you can speed things up a lot by pre-filtering for that. Something like this should work (paste these lines into Terminal one at a time, waiting for each command to finish before pasting the next):

find . -type f -print | sort >pathnames.txt
sed 's:^.*/::' pathnames.txt >filenames.txt
sort filenames.txt | uniq -d >dupenames.txt
paste filenames.txt pathnames.txt | grep -Fhf dupenames.txt | sort | cut -f2 >dupepaths.txt
tr '\n' '\0' <dupepaths.txt | xargs -0 sha512sum >shasums.txt

You can sanity-check the text files made at each step (the ones whose names follow a >) before going on to the next by looking at them with Textedit or similar. All of these steps except the last should complete quickly because all they're doing is working with lists of filenames, not file contents.

Now that you have a list of checksums in shasums.txt, finding files with identical content is quick:

cut -d ' ' -f1 shasums.txt | sort | uniq -d >dupesums.txt
grep -Fhf dupesums.txt shasums.txt | sort >dupes.txt

The list of duplicated files is now in dupes.txt. Files prefixed with identical strings of hexadecimal gibberish have identical contents.

If my overall aim was to copy an entire file tree from an external drive to a space-limited internal one without burning space on duplicates, I wouldn't actually begin by deleting duplicates off the source drive because in my experience big deletion sessions always end in heartache. instead, I'd create hard links on the destination, allowing one copy of the file to show up inside multiple folders. That's easily done once shasums.txt exists. Let me know if you'd rather go that way as well and I'll show you how. It can't be done on an exFAT filesystem because exFAT doesn't support hard links, so you can't use this technique to save space on your external drive. It will work just fine with the native Mac filesystem though.
posted by flabdablet at 9:16 AM on May 19, 2023 [10 favorites]

WOW, flabdablet - this is a great lesson in the elements of the command line and an answer to my problem all at once! Thanks for taking your answer so far beyond the question -- I LOVE it!
posted by tzikeh at 10:51 AM on May 19, 2023

OK, so let's whip up a script to do a de-duplicating copy of all the files in a shasums list.

First thing is establishing a workflow for getting shell scripts to run. Try pasting this whole block of text into Terminal all in one hit:

cat <<EOF >yow
#!/bin/bash
echo Yow!
EOF

You should now have a file in your current folder named yow. Check to see that it's there using the ls -l command; one of the lines you see should look something like

-rw-r--r-- 1 tzikeh  staff  31 May 20 04:16 yow

You might be able to open this with Textedit to see what's inside it; I don't have a Mac so I can't check whether Textedit will crack the sads because the filename has no extension. I don't think it will. In any case, we can see what's in there using a shell command: enter cat yow and the result should look like

#!/bin/bash
echo Yow!

Next step is to mark that file as executable so that the system lets us use it as a new command:

chmod +x yow
ls -l

and now the relevant line of output should look like

-rwxr-xr-x 1 tzikeh  staff  31 May 20 04:16 yow

Notice that the permissions string at the left hand side now has some new x characters in it. These are Execute permissions for the tzikeh user, the staff group, and everybody else respectively. So let's try actually running the thing: typing ./yow should get you

Yow!

Let me know if you hit any snags getting that to work. If all's well, next step is to make and run a copying script.
posted by flabdablet at 11:43 AM on May 19, 2023

« Older Japanese school shoes, in Seattle? | I don't know why I do things. Newer »

This thread is closed to new comments.

Ask MetaFilter

On a Mac: find duplicate music files in a folder full of subfolders
May 18, 2023 4:09 PM Subscribe

Tags

Share

On a Mac: find duplicate music files in a folder full of subfolders May 18, 2023 4:09 PM Subscribe

Tags

Share

On a Mac: find duplicate music files in a folder full of subfolders
May 18, 2023 4:09 PM Subscribe