Is there an AI (or other) program that can dedupe a set of files?
December 16, 2024 3:00 PM Subscribe
I have a large collection of video files located on a hard drive. Is there a program I can run that would scan this folder and suggest files that might be duplicates? Something that's like hey, this file seems similar to this one--either by scanning the file itself or even just the file names?
The files all have descriptive names but are not named consistently--for example I might have "Name date" and "Date name." I suspect I have many files that are functional duplicates--they might have different file sizes but are essentially the same file. Having a solution would save me a boat load of work.
The files all have descriptive names but are not named consistently--for example I might have "Name date" and "Date name." I suspect I have many files that are functional duplicates--they might have different file sizes but are essentially the same file. Having a solution would save me a boat load of work.
video duplicate finder - haven't used it but it appears to do what you want.
posted by zippy at 4:19 PM on December 16, 2024 [1 favorite]
posted by zippy at 4:19 PM on December 16, 2024 [1 favorite]
Also there is the previous AskMe on deduping files.
The underlying process is to "hash" or associate each file with a unique identifier generated from the contents of its data. This ID is generated from the contents of the file and is unique to the data — the contents — and not the filename.
This is essentially the same process as described in blob's Stack Overflow-linked answer; however the AskMe I linked to has pointers to GUI tools that might make it easier to manage for a large set of files, or if you are not comfortable with command-line tools or scripting.
Once you have done this hashing procedure for all the files, you can look for files that have the same ID. Those are the duplicates.
This won't work for image or video files that use a lossy compression scheme and have been resaved or otherwise modified in some way, and their file sizes will likely change, along with their contents. This will only work on files that are byte-for-byte duplicates. Files that are "essentially" the same but not the same will not be labeled as duplicates.
posted by They sucked his brains out! at 6:19 PM on December 16, 2024
The underlying process is to "hash" or associate each file with a unique identifier generated from the contents of its data. This ID is generated from the contents of the file and is unique to the data — the contents — and not the filename.
This is essentially the same process as described in blob's Stack Overflow-linked answer; however the AskMe I linked to has pointers to GUI tools that might make it easier to manage for a large set of files, or if you are not comfortable with command-line tools or scripting.
Once you have done this hashing procedure for all the files, you can look for files that have the same ID. Those are the duplicates.
This won't work for image or video files that use a lossy compression scheme and have been resaved or otherwise modified in some way, and their file sizes will likely change, along with their contents. This will only work on files that are byte-for-byte duplicates. Files that are "essentially" the same but not the same will not be labeled as duplicates.
posted by They sucked his brains out! at 6:19 PM on December 16, 2024
Response by poster: The underlying process is to "hash" or associate each file with a unique identifier generated from the contents of its data. This ID is generated from the contents of the file and is unique to the data — the contents — and not the filename.
Thanks for the thoughts so far. The trouble with the hash approach is, if I'm understanding correctly, is it's based on the content of the file itself. Which means if I have two copies of Movie X but they're at different resolutions, I don't think they would get flagged. You and I can look at them and see they're different, but the hash approach will see two totally different sets of data, right?
I think that's why I need something that can take a filename approach. Something that might see all of these as potentially suspect: "Lord of the Rings - Fellowship"; "LOTR - Fellowship"; "Fellowship of the Ring - LOTR"; etc., for example.
So far I've been using the voidtools everything app to run manual searches, but it's impractical to do that for thousands of files.
posted by soonertbone at 6:47 PM on December 16, 2024
Thanks for the thoughts so far. The trouble with the hash approach is, if I'm understanding correctly, is it's based on the content of the file itself. Which means if I have two copies of Movie X but they're at different resolutions, I don't think they would get flagged. You and I can look at them and see they're different, but the hash approach will see two totally different sets of data, right?
I think that's why I need something that can take a filename approach. Something that might see all of these as potentially suspect: "Lord of the Rings - Fellowship"; "LOTR - Fellowship"; "Fellowship of the Ring - LOTR"; etc., for example.
So far I've been using the voidtools everything app to run manual searches, but it's impractical to do that for thousands of files.
posted by soonertbone at 6:47 PM on December 16, 2024
One possibility perhaps is video fingerprinting, which is often used for copyright enforcement of video files that may be recompressed but visually similar.
There's a Python package called videofingerprint that might be of interest. This reduces videos to a character string, based on color and luminance, which would be resilient to codec or compression differences. You probably couldn't do an exact string-to-string identity matchup for larger files, but instead you might use a distance function between pairs of strings to find probable dupes, where pair distance is minimized.
posted by They sucked his brains out! at 7:08 PM on December 16, 2024 [1 favorite]
There's a Python package called videofingerprint that might be of interest. This reduces videos to a character string, based on color and luminance, which would be resilient to codec or compression differences. You probably couldn't do an exact string-to-string identity matchup for larger files, but instead you might use a distance function between pairs of strings to find probable dupes, where pair distance is minimized.
posted by They sucked his brains out! at 7:08 PM on December 16, 2024 [1 favorite]
You could reduce the number of string comparisons by putting fingerprints into sets based on video metadata, such as narrow runtime ranges, which are less likely to change much even if the file's video stream is recompressed. You probably won't get (m)any duplicates from video streams of vastly different runtimes, so there is likely no need to compare those pairs.
posted by They sucked his brains out! at 7:16 PM on December 16, 2024
posted by They sucked his brains out! at 7:16 PM on December 16, 2024
An LLM might be able to suss this out for you, I tried ChatGPT with some synthetic file paths (click to expand):
Result:
It might not do so well on a large amount of data or on files that aren't named after things it's already knowledgeable about (in this case, popular movies).
Another approach is to literally throw a file listing into ChatGPT and just ask it to find probable sets of duplicates. If there are too many for ChatGPT, Google Gemini supports fairly enormous inputs. It's liable to get things wrong, but it could still speed you up.
posted by BungaDunga at 7:47 PM on December 16, 2024
here's some file paths for movies. I want you to associate them with a standardized title (eg "Star Wars (1977)"). list each file and their associated title, separated by commas, in a code block
Result:
Media/Saga/Star_Wars_1977_IV_A_New_Hope[HD].mkv, Star Wars (1977) Archive/Nolan/INCEPTION_2010_HDrip.m4v, Inception (2010) Movies/Superhero Movies/Batman-TDK.mpg, The Dark Knight (2008) Movies/Blockbusters/Jurassic_Park_DVDRip.mpg, Jurassic Park (1993) Movies/Matrix/Matrix.avi, The Matrix (1999) Movies/Drama/Gladiator_2000_FullHD.mpg, Gladiator (2000) Movies/Blockbusters/Avatar_Full_HD.mpg, Avatar (2009) Media/James Cameron/Titanic_1997[HD].mkv, Titanic (1997) Movies/Romance/Titanic.avi, Titanic (1997) Archive/Christopher Nolan/DarkKnight_2008_HDR.m4v, The Dark Knight (2008) Movies/Crime/Quentin_Tarantino_Pulp_Fiction.mpg, Pulp Fiction (1994) Downloads/Films/Pandora/Avatar_1080p.mp4, Avatar (2009) Movies/Batman/TDK_2008.avi, The Dark Knight (2008) Downloads/Films/Fantasy/LOTR - Fellowship of the Ring [1080p].mp4, The Lord of the Rings: The Fellowship of the Ring (2001) Media/RidleyScott/Gladiator_2000[HD].mkv, Gladiator (2000) Downloads/Films/DC/TheDarkKnight_1080p.mp4, The Dark Knight (2008) Movies/James Cameron/Avatar_2009.avi, Avatar (2009) Media/SciFi/Matrix_1999[BRRip].mkv, The Matrix (1999) Downloads/Films/Mindbenders/Inception_1080p.mp4, Inception (2010) Downloads/Films/Ocean/Titanic_HDRip_720p.mp4, Titanic (1997) Movies/Tarantino/Pulp_Fiction.avi, Pulp Fiction (1994) Movies/Sci-Fi Classics/StarWars_EpIV.mpg, Star Wars (1977) Media/Dinosaur_Movies/Jurassic_Park[1993].mkv, Jurassic Park (1993) Downloads/Films/JP/JurassicPark-1080p.mp4, Jurassic Park (1993) Movies/Blockbusters/Titanic_Full_1080p.mpg, Titanic (1997) Archive/Tarantino/PulpFiction_HDRip_94.m4v, Pulp Fiction (1994) Downloads/Films/Lucas/StarWars_EP4_720p.mp4, Star Wars (1977) Archive/Peter Jackson/LOTR/Fellowship_of_Ring-2001.m4v, The Lord of the Rings: The Fellowship of the Ring (2001) Archive/Star Wars Original Trilogy/ANH_StarWars_1977.m4v, Star Wars (1977) Movies/Historical/Gladiator.avi, Gladiator (2000) Archive/Historical/Gladiator_DVDRip.m4v, Gladiator (2000) Archive/Sci-Fi Classics/Matrix_4K.mpg, The Matrix (1999) Movies/LOTR/Fellowship/LOTR_Fellowship.avi, The Lord of the Rings: The Fellowship of the Ring (2001) Media/Movies/LOTR_Collection/The_Fellowship_of_the_Ring.mkv, The Lord of the Rings: The Fellowship of the Ring (2001) Media/3D Movies/Avatar_HDRip.mkv, Avatar (2009) Archive/Sci-Fi/Spielberg/JURASSICPARK_93_HDRip.m4v, Jurassic Park (1993) Movies/Top Movies/Inception[Director's Cut].mpg, Inception (2010) Media/Nolan/Batman_TDK[BRRip].mkv, The Dark Knight (2008) Movies/Keanu Reeves/Matrix1999_REMASTERED.m4v, The Matrix (1999) Downloads/Films/Colosseum/GLADIATOR_HDRip_1080p.mp4, Gladiator (2000) Movies/HD Movies/LOTR1-Fellowship_DVDrip.mpg, The Lord of the Rings: The Fellowship of the Ring (2001) Archive/Drama/Titanic_JC_1997.m4v, Titanic (1997) Archive/Sci-Fi/Avatar_2009_HDR.m4v, Avatar (2009) Downloads/Films/Neo/The_Matrix_HDRip.mp4, The Matrix (1999) Movies/Speilberg/Jurassic_Park.avi, Jurassic Park (1993) Media/Classics/PulpFiction_1994[HD].mkv, Pulp Fiction (1994) Downloads/Films/Cult/Pulp_Fiction_720p.mp4, Pulp Fiction (1994) Media/Christopher Nolan/Inception_BRRip_2010.mkv, Inception (2010) Movies/StarWars/StarWars_IV_NewHope.avi, Star Wars (1977) Movies/Nolan/Inception.avi, Inception (2010)
It might not do so well on a large amount of data or on files that aren't named after things it's already knowledgeable about (in this case, popular movies).
Another approach is to literally throw a file listing into ChatGPT and just ask it to find probable sets of duplicates. If there are too many for ChatGPT, Google Gemini supports fairly enormous inputs. It's liable to get things wrong, but it could still speed you up.
posted by BungaDunga at 7:47 PM on December 16, 2024
Re: the LLM, you could ask it to output a pairwise probability that each file name is a match, then you can start with the most likely matches.
posted by MisantropicPainforest at 8:12 PM on December 16, 2024
posted by MisantropicPainforest at 8:12 PM on December 16, 2024
ffprobe is part of the ffmpeg suite (because all multimedia questions have an answer like "you should use ffmpeg") and it can supply metadata about resolution, video codec, audio channels and subtitling in place of that exact-file-hashing solution.
posted by k3ninho at 2:48 AM on December 17, 2024
posted by k3ninho at 2:48 AM on December 17, 2024
For the specific task of de-duplicating videos that are not the same file but the same content, I'd use Plex. It's a video library management program and is very good at figuring at a canonical name for a random video file. It has tools for de-duplicating your library.
For the general task of identifying bit-identical dupes I use fdupes.
posted by Nelson at 10:33 AM on December 17, 2024
For the general task of identifying bit-identical dupes I use fdupes.
posted by Nelson at 10:33 AM on December 17, 2024
You are not logged in, either login or create an account to post comments
This approach is significantly faster than just getting a list of files and running the expensive md5 calculation on each.
Python code included in the answer
posted by blob at 4:13 PM on December 16, 2024 [3 favorites]