not every snowflake is special
February 11, 2008 11:23 AM
Location & name independent rsync?
I have machine A, with lots and lots of empty space.
I have machines B, C, D, E, F, ... with lots of files, some of which are duplicated between machines, but often not in the same file structure or naming convention, ala
Machine B:
/home/userX/media/Wedding01.avi
Machine C:
/var/local/storage/movies/sis-wedding.avi
Ideally, I'd like an agent I can put on Machines B-F, point it at specific folder(s), and give it an ftp account on machine A, and let it auto-magically figure out the minimum set of files that need to be backed-up, to have a complete set of the files across all the original machines. It does NOT have to remember where the files were on each machine.
Machines B-F are currently windows boxen that expose their data via windows shares, so I can point a unix client at them, if needed.
Machine A can be a unix, windows, or mac os X box, whatever is needed.
I have machine A, with lots and lots of empty space.
I have machines B, C, D, E, F, ... with lots of files, some of which are duplicated between machines, but often not in the same file structure or naming convention, ala
Machine B:
/home/userX/media/Wedding01.avi
Machine C:
/var/local/storage/movies/sis-wedding.avi
Ideally, I'd like an agent I can put on Machines B-F, point it at specific folder(s), and give it an ftp account on machine A, and let it auto-magically figure out the minimum set of files that need to be backed-up, to have a complete set of the files across all the original machines. It does NOT have to remember where the files were on each machine.
Machines B-F are currently windows boxen that expose their data via windows shares, so I can point a unix client at them, if needed.
Machine A can be a unix, windows, or mac os X box, whatever is needed.
probably the flat-dir, but really, I don't care about the paths or names of the backuped files, only the duplication factor. In general, I'd prefer to end up with only a single filename for each file; where the file exists under two names with the same data, that's my failing in filing, not an indication of important informational differences between the two files.
posted by nomisxid at 12:05 PM on February 11, 2008
posted by nomisxid at 12:05 PM on February 11, 2008
If you have multiple filenames for the same content you're going to need to hash the files and compare each of them to the others in order to figure out whether they're the same or not, unless you have some kind of list. I guess you could also compare file sizes, but that'd be less reliable. At any rate, this would be an extra step.
Generally, you could do this with find and scp, such that (in unix/bash):
#!/bin/sh
while read file; do
#trailing slash in destination is important
rsync $file host:/backup/path/
done< <(find /source/path -type f)
posted by rhizome at 12:15 PM on February 11, 2008
Generally, you could do this with find and scp, such that (in unix/bash):
#!/bin/sh
while read file; do
#trailing slash in destination is important
rsync $file host:/backup/path/
done< <(find /source/path -type f)
posted by rhizome at 12:15 PM on February 11, 2008
An idea:
-make a hash (SHA1 or such) of each file within the machine (B-F).
-fetch a list of hashes from machine A, compare, upload files with no matching hashes
-append the hash to the filename while uploading to avoid collisions with similarly named, but different files, and this way they can be all kept in a single location
Something like this should be easy to make with Python or Perl, which are available on all of these platforms.
posted by phax at 12:15 PM on February 11, 2008
-make a hash (SHA1 or such) of each file within the machine (B-F).
-fetch a list of hashes from machine A, compare, upload files with no matching hashes
-append the hash to the filename while uploading to avoid collisions with similarly named, but different files, and this way they can be all kept in a single location
Something like this should be easy to make with Python or Perl, which are available on all of these platforms.
posted by phax at 12:15 PM on February 11, 2008
Via Halfbakery, I found Automated Snapshot Style Backups with Rsync, which gives you a really quick way to backup to your server.
To conserve space on the server, there should be utilities to sort of combine duplicate files, switching the copies with hard links to the "original." Hardlink appears to be one such utility.
This isn't always absolutely perfect for network usage, but it's pretty good, & you won't have to write your own software for this.
posted by Pronoiac at 2:50 AM on February 12, 2008
To conserve space on the server, there should be utilities to sort of combine duplicate files, switching the copies with hard links to the "original." Hardlink appears to be one such utility.
This isn't always absolutely perfect for network usage, but it's pretty good, & you won't have to write your own software for this.
posted by Pronoiac at 2:50 AM on February 12, 2008
Gah. I looked into a problem like this a few years ago, goofing around with Perl with a friend, & this kind of jogged that memory, so now I'm sharing the joy.
The rsync snapshot system in my last comment transfers files the server might already have when a client first syncs, or has files that were renamed or moved.
How do you deal with those? If you're on a local network, just transfer the files. It's a lot simpler. The payoff of the more complex approach is dubious, due to having to confirm matches, scan over the disk a couple of times with accompanying CPU usage, etc.
If, though, you're on dialup, or potentially shaving days or more off of transfer over broadband, then, um. Write your own software. The absolute minimum metadata transfer I worked out was: run aide or tripwire on your client to generate a checksum database, (optionally, diff that against a local copy of the last checksum database the server has,) & transfer that db or diff. On the server, put up the last snapshot, copy or hard link in any files you recognize in the diff, then rsync everything.
That takes care of the first sync, & it catches renamed & moved files, which are re-transferred in the snapshot system in my previous comment.
Oh, man, I remembered one step further, for incomplete files that got renamed & changed, but I'm nerded out enough for now.
posted by Pronoiac at 3:38 AM on February 12, 2008
The rsync snapshot system in my last comment transfers files the server might already have when a client first syncs, or has files that were renamed or moved.
How do you deal with those? If you're on a local network, just transfer the files. It's a lot simpler. The payoff of the more complex approach is dubious, due to having to confirm matches, scan over the disk a couple of times with accompanying CPU usage, etc.
If, though, you're on dialup, or potentially shaving days or more off of transfer over broadband, then, um. Write your own software. The absolute minimum metadata transfer I worked out was: run aide or tripwire on your client to generate a checksum database, (optionally, diff that against a local copy of the last checksum database the server has,) & transfer that db or diff. On the server, put up the last snapshot, copy or hard link in any files you recognize in the diff, then rsync everything.
That takes care of the first sync, & it catches renamed & moved files, which are re-transferred in the snapshot system in my previous comment.
Oh, man, I remembered one step further, for incomplete files that got renamed & changed, but I'm nerded out enough for now.
posted by Pronoiac at 3:38 AM on February 12, 2008
Does anybody else ever get annoyed at their cryptic notes to themselves? So I don't get annoyed at myself in a few years:
Scanning lots of data for a match of an arbitrary block is doable with the sliding windows of par2. If writing new software, client-side rsync or making a list of blocks with hashes might be interesting, though the latter is more for a filesystem that smooshes the unchanged blocks between two versions of a file together. (Did plan9 do that? I forget.)
The rsync snapshot item that got me a best answer above (thanks!) got packaged into rsnapshot.
posted by Pronoiac at 4:25 PM on March 18, 2008
Scanning lots of data for a match of an arbitrary block is doable with the sliding windows of par2. If writing new software, client-side rsync or making a list of blocks with hashes might be interesting, though the latter is more for a filesystem that smooshes the unchanged blocks between two versions of a file together. (Did plan9 do that? I forget.)
The rsync snapshot item that got me a best answer above (thanks!) got packaged into rsnapshot.
posted by Pronoiac at 4:25 PM on March 18, 2008
This thread is closed to new comments.
/backups/machineA/path/to/files
/backups/machineB/path1/to/files
/backups/machineB/path2/to/files
and so on?
Or do you just want a flat directory, like:
/backups/file-machineA-1.avi
/backups/file-machineA-2.avi
/backups/file-machineB-1.avi
etc...?
posted by rhizome at 11:53 AM on February 11, 2008