Comparing folder contents using Terminal in MacOS
March 26, 2021 3:26 PM Subscribe
In MacOS, I'm comparing the contents of folders on several external drives. In this particular situation I know identically-named files will be identical, so I only need to know how folders differ based on the names of the files inside them (which are inside lots of subfolders). I'm using the Terminal command diff -rq "pathname 1" "pathname 2". However, it's proceeding very slowly and seems to be looking for differences inside the files themselves. So is this really the command I should be using?
The reason I think it's trying to look for differences within files (like a deduplicating app might do) is because I see results like this:
Files /Volumes/pathname 1/.DS_Store and /Volumes/pathname 2/.DS_Store differ
The reason I think it's trying to look for differences within files (like a deduplicating app might do) is because I see results like this:
Files /Volumes/pathname 1/.DS_Store and /Volumes/pathname 2/.DS_Store differ
Best answer: diff <(cd "pathname 1"; find . | sort) <(cd "pathname 2"; find . | sort)
Should do what you're after.
posted by parm at 3:46 PM on March 26, 2021 [9 favorites]
Should do what you're after.
posted by parm at 3:46 PM on March 26, 2021 [9 favorites]
Yeah, that's diffing the files. Maybe try outputting the file hierarchies to text files and diffing those files? e.g. with ls or find
You might need to tweak that to strip out the leading directories. There are more sophisticated ways to do it, but this is rough and ready.
posted by holgate at 3:48 PM on March 26, 2021
find pathname1 > pathname1.txt
find pathname 2 > pathname2.txt
diff pathname1.txt pathname2.txt
You might need to tweak that to strip out the leading directories. There are more sophisticated ways to do it, but this is rough and ready.
posted by holgate at 3:48 PM on March 26, 2021
One thing you could do is get a listing of each directory, and save to a file, and then diff the files.
find "pathname 1" -type f | sort > "pathname 1.txt"
That searches in pathname 1 for everything of type f (file), then sorts it alphabetically, then saves it to a file named "pathname 1.txt".
Then:
diff "pathname 1.txt" "pathname 2.txt"
Using sort makes sure the filenames are in the same position in the files otherwise diff will report them as missing.
posted by JawnBigboote at 3:50 PM on March 26, 2021
find "pathname 1" -type f | sort > "pathname 1.txt"
That searches in pathname 1 for everything of type f (file), then sorts it alphabetically, then saves it to a file named "pathname 1.txt".
Then:
diff "pathname 1.txt" "pathname 2.txt"
Using sort makes sure the filenames are in the same position in the files otherwise diff will report them as missing.
posted by JawnBigboote at 3:50 PM on March 26, 2021
Response by poster: phliar -- The contents of these drives are, for the most part, identical. But there are some files that are unique to each drive. So I'm hoping to see results that say something like, "Drive A has these files that don't exist on Drive B and Drive B has these files that don't exist on Drive A"
posted by theory at 3:51 PM on March 26, 2021
posted by theory at 3:51 PM on March 26, 2021
The fastest thing I can think of would be to use fd (NOTE: This is a link to fd-find) and comm to do a find. For example, I took a set of directories that were identical and added 5 files to each. Now you can do:
$ comm -13 <(fd -I -t f --base-directory GEOSgcm-1 | sort) <(fd -I -t f --base-directory GEOSgcm-2 | sort)
src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/@GEOSchem_GridComp/@GOCART/foo3
src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/GEOSmoist_GridComp/foo2
src/Components/@GEOSgcm_GridComp/GEOSdataatm_GridComp/foo1
src/Shared/@FMS/foo4
src/Shared/@MAPL/foo5
and that gives you the files unique to the second side (GEOSgcm-2 where I put foo files). For the first directories unique files, use comm -23:
$ comm -23 <(fd -I -t f --base-directory GEOSgcm-1 | sort) <(fd -I -t f --base-directory GEOSgcm-2 | sort)
@cmake/@ecbuild/bar5
@cmake/bar2
src/Applications/@UMD_Etc/bar4
src/Components/@GEOSgcm_GridComp/GEOSogcm_GridComp/GEOSocean_GridComp/GuestOcean_GridComp/MOM6_GEOSPlug/@mom6/bar3
src/Components/@GEOSgcm_GridComp/bar1
These commands are pretty quick. For example, on my system a "diff -rq" of these directories takes around 1.5 seconds, while the fd commands might take milliseconds. I'm not great at hyperfine but:
❯ hyperfine -w 10 -i "diff -rq GEOSgcm-1 GEOSgcm-2" "comm -23 <(fd -I -t f --base-directory GEOSgcm-1 | sort) <(fd -I -t f --base-directory GEOSgcm-2 | sort)"
Benchmark #1: diff -rq GEOSgcm-1 GEOSgcm-2
Time (mean ± σ): 1.514 s ± 0.058 s [User: 237.0 ms, System: 1060.3 ms]
Range (min … max): 1.440 s … 1.588 s 10 runs
Warning: Ignoring non-zero exit code.
Benchmark #2: comm -23 <:(fd -I -t f --base-directory GEOSgcm-1 | sort) <:(fd -I -t f --base-directory GEOSgcm-2 | sort)
Time (mean ± σ): 0.2 ms ± 0.3 ms [User: 0.0 ms, System: 0.1 ms]
Range (min … max): 0.0 ms … 2.0 ms 852 runs
Warning: Command took less than 5 ms to complete. Results might be inaccurate.
Warning: Ignoring non-zero exit code.
Warning: The first benchmarking run for this command was significantly slower than the rest (0.0 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
Summary
'comm -23 <:(fd -I -t f --base-directory GEOSgcm-1 | sort) <:(fd -I -t f --base-directory GEOSgcm-2 | sort)' ran
8935.61 ± 16837.18 times faster than 'diff -rq GEOSgcm-1 GEOSgcm-2'
I think the comm-fd command is so fast compared to diff that hyperfine can't quite figure out the difference. :)
ETA: To fix html fun.
posted by Fortran at 4:07 PM on March 26, 2021
$ comm -13 <(fd -I -t f --base-directory GEOSgcm-1 | sort) <(fd -I -t f --base-directory GEOSgcm-2 | sort)
src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/@GEOSchem_GridComp/@GOCART/foo3
src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/GEOSmoist_GridComp/foo2
src/Components/@GEOSgcm_GridComp/GEOSdataatm_GridComp/foo1
src/Shared/@FMS/foo4
src/Shared/@MAPL/foo5
and that gives you the files unique to the second side (GEOSgcm-2 where I put foo files). For the first directories unique files, use comm -23:
$ comm -23 <(fd -I -t f --base-directory GEOSgcm-1 | sort) <(fd -I -t f --base-directory GEOSgcm-2 | sort)
@cmake/@ecbuild/bar5
@cmake/bar2
src/Applications/@UMD_Etc/bar4
src/Components/@GEOSgcm_GridComp/GEOSogcm_GridComp/GEOSocean_GridComp/GuestOcean_GridComp/MOM6_GEOSPlug/@mom6/bar3
src/Components/@GEOSgcm_GridComp/bar1
These commands are pretty quick. For example, on my system a "diff -rq" of these directories takes around 1.5 seconds, while the fd commands might take milliseconds. I'm not great at hyperfine but:
❯ hyperfine -w 10 -i "diff -rq GEOSgcm-1 GEOSgcm-2" "comm -23 <(fd -I -t f --base-directory GEOSgcm-1 | sort) <(fd -I -t f --base-directory GEOSgcm-2 | sort)"
Benchmark #1: diff -rq GEOSgcm-1 GEOSgcm-2
Time (mean ± σ): 1.514 s ± 0.058 s [User: 237.0 ms, System: 1060.3 ms]
Range (min … max): 1.440 s … 1.588 s 10 runs
Warning: Ignoring non-zero exit code.
Benchmark #2: comm -23 <:(fd -I -t f --base-directory GEOSgcm-1 | sort) <:(fd -I -t f --base-directory GEOSgcm-2 | sort)
Time (mean ± σ): 0.2 ms ± 0.3 ms [User: 0.0 ms, System: 0.1 ms]
Range (min … max): 0.0 ms … 2.0 ms 852 runs
Warning: Command took less than 5 ms to complete. Results might be inaccurate.
Warning: Ignoring non-zero exit code.
Warning: The first benchmarking run for this command was significantly slower than the rest (0.0 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
Summary
'comm -23 <:(fd -I -t f --base-directory GEOSgcm-1 | sort) <:(fd -I -t f --base-directory GEOSgcm-2 | sort)' ran
8935.61 ± 16837.18 times faster than 'diff -rq GEOSgcm-1 GEOSgcm-2'
I think the comm-fd command is so fast compared to diff that hyperfine can't quite figure out the difference. :)
ETA: To fix html fun.
posted by Fortran at 4:07 PM on March 26, 2021
Uuuuuugggh. Okay. I am bad at html. I'll try and paste the comm-fd command one more time:
Sigh. I like Markdown... :)
posted by Fortran at 4:20 PM on March 26, 2021 [2 favorites]
comm -23 <(fd -I -t f --base-directory GEOSgcm-1 | sort) <(fd -I -t f --base-directory GEOSgcm-2 | sort)
Sigh. I like Markdown... :)
posted by Fortran at 4:20 PM on March 26, 2021 [2 favorites]
rsync has a dry-run mode and is very good at working out the files that are different between a pair of trees.
posted by k3ninho at 12:45 AM on March 27, 2021 [2 favorites]
posted by k3ninho at 12:45 AM on March 27, 2021 [2 favorites]
Best answer: comm is the tool for this. I tend to use temporary files instead of the fancy command line process thing '<(foo)' because with the temporary files you can run comm multiple times to get your different results. Why do the finding and sorting multiple times instead of once.
posted by zengargoyle at 10:40 AM on March 27, 2021 [1 favorite]
cd dir-1; find . -type f | sort > ../dir-1.lst; cd .. cd dir-2; find . -type f | sort > ../dir-2.lst; cd .. comm -23 dir-1.lst dir2-lst # files unique to dir-1 comm -13 dir-1.lst dir2.lst # files unique to dir-2 comm -12 dir-1.lst dir-2.lst # files common in both dir-1 and dir-2
posted by zengargoyle at 10:40 AM on March 27, 2021 [1 favorite]
« Older How to build a house for climate change? | What's your non-Amazon online retailer of choice? Newer »
This thread is closed to new comments.
I don't fully follow what you need to do... to only look at the names of files I'd use something that gives me a list of all the files in the folder (including subfolders) in some canonical order and then compare those lists. Example:
ls -lR "pathname 1" > files1
ls -lR "pathname 2" > files2
Now you can compare files1 with files2. Does that make sense?
posted by phliar at 3:46 PM on March 26, 2021 [1 favorite]