Open-source or trial de-duplication tools anyone can recommend?
January 9, 2009 12:46 PM   Subscribe

I have several old data backup devices, with some overlap on what's on them. I'd like to consolidate them all onto a new device, but without having 3 copies of the overlapping redundant data. I think the term I'm looking for is de-duplication. Any open source or free trial tools that MeFites can recommend?

I'm looking at several old SNAP! Server devices that were used for data backups. The information on all three is valuable, but there is some overlap of what's contained on them. And what's on there is so jumbled and disorganized that there's no way to sort it all out manually without an unreasonable amount of effort.

The hard drives on these things are basically at full capacity, so a new backup device is needed. Rather then keep 4 devices networked to access all this information (and continue having to search 3 devices to find a file), I'd like to consolidate all that data onto a new device as one big set of files.

So I'm looking for some tool that will let me get a new device (with a larger capacity), point some software at the three old devices, and tell it to move the contents of the three old devices onto the one new device, BUT NOT to copy multiples/duplicates. This should empty out the older devices so they can be wiped and disposed of, and give me one new device with all of their contents, but eliminating the need to store dozens of GB of overlapping files they all used to contain.

Anything anyone can recommend?
posted by penciltopper to Computers & Internet (2 answers total) 2 users marked this as a favorite
 
Given that:

a) the information on all 3 disks is valuable
b) what's on there is jumbled and disorganized
c) duplicates are in the dozens of GB

and that "disks are cheap, but data may be priceless" you would be best served by making a full backup (hell, make several) of all 3 devices and only then doing de-duplication on another copy of the data. Measure twice, rm once.

For de-duping, use a tool that computes checksums over all the files, and then either deletes dupes or hard-links / junctions them (depending on your specific situation -- OS, filesystem, and whether you care about preserving structure or simply want to save space).

Check out trimtrees.pl or Google for "duplicate file finder".
posted by lascimmia at 1:35 PM on January 9, 2009


You'll be able to do this with rsync (which is a Linux app). You can either install cygwin to run the command or follow another howto.

Basically, it will look something like this:
  1. rsync -a driveA/backup bigDrive/backup
  2. rsync -a driveB/backup bigDrive/backup
  3. rsync -a driveC/backup bigDrive/backup
You can tell rsync to compare with file checksums or with dates, I believe, so if driveA has newer stuff on it than driveC, your driveC copy won't obliterate the newer stuff.
posted by yellowbkpk at 1:52 PM on January 9, 2009 [1 favorite]


« Older Where is the film Ross Kemp made on STDs?   |   Going back to school... Help! Newer »
This thread is closed to new comments.