What features should I add to my basic backup solution?
February 16, 2008 4:02 PM   Subscribe

I've rolled my own backup solution, and I want to know if I could/should be doing anything more. What I'm basically doing now is copying over everything every 10 days, and deleting anything that's over 4 backups old.

Basically, here's some of the files in my directory for the laptop:

080110.bookmarks.html
080110.cyg-home.tar
080110.desktop.tar
080110.music/
080110.myjunk.tar
080110.personal.tc
080119.bookmarks.html
080119.cyg-home.tar
080119.desktop.tar
080119.music/

The date of the backup is part of the name. The music directories are about 15G each, the others are less than 1MB. I'm using Bash shell scripts, ssh and scp. (I couldn't get rsync to be much faster than scp. I was using samba, but that just kept spinning-up my drive.) This is all done on my local wireless network.

I've already got the best feature: automatic. Should I be doing monthly/weekly/daily backups? What would that entail, and how do I do that? I've got the frequency set at 10 days, but more frequent would be nice. My problem is that frequent backups create more new files, so then I have to get rid of the not-so-old files (size is limited).

Please give me your recommendations.
posted by philomathoholic to Computers & Internet (24 answers total) 6 users marked this as a favorite
 
Response by poster: I just realized I should be doing some sort of compression on some of the files.
posted by philomathoholic at 4:05 PM on February 16, 2008


Best answer: I'd suggest using rsync instead. You're missing out on a nice feature that it provides, namely the ability to copy only files that have more recent modification dates on the source machine than on the destination... so you copy only what changes, instead of everything every time.

Of course, you'd have to give up tarballs for that... But assuming most of your data doesn't change every week, you would save a lot of bandwidth, which would make daily backups much more practical. Rsync can also use ssh as a transport (-e flag), which should be the exact same speed as scp and uses the same authentication+encryption+compression. The rsync line I use for backups: rsync -avz -e ssh source dest

Compressing the tarballs will save some space (with tar, -z for .gz or -j for .bz2), but it won't do anything but waste time for a music collection or other media that's already in a compressed format.
posted by qxntpqbbbqxl at 4:24 PM on February 16, 2008


I am sure you will come up with something snazzier if you look into combining rsync with LVM snapshots. If you create a snapshot of your backup volume immediately before writing today's backups to it (using an only-copy-changed-or-missing-files strategy instead of using tarballs would be best) then you get a reach-back-in-time ability without having to bother about deleting old backups; the snapshots will just delete themselves when they can no longer represent the differences between their own state and that of the underlying volume. Snapshots that you allocate more space to will survive for longer.
posted by flabdablet at 4:39 PM on February 16, 2008


Response by poster: qxntpqbbbqxl: So if I went with rsync, how would you suggest I set up the file structure? Something like a month-old baseline with daily directories of changed files? I like having multiple recovery points to choose from, so I don't want just one daily-backed-up file for each source file. I assume that recovery would consist of first copying over the baseline files, then copying over whichever set of changed files I chose?

I won't attempt to compress my encrypted files or my music collection.
posted by philomathoholic at 4:42 PM on February 16, 2008


You can keep multiple recovery points with rsync. Let's say last week you had backup.20080207 on your remote server. This week cp -pr it to backup.20080217, and then run rsync to that directory. You'll keep multiple copies on the server without having to copy everything down the wire twice.
posted by grouse at 5:01 PM on February 16, 2008


Best answer: I'd recommend signing up for A3 or another bulk file storage service and periodically uploading snapshots so that you have some physical isolation for your data in the event that your home/office/compound falls into a hellmouth or something. I just signed up for Mozy and it's going to take weeks to do the initial backup, but I feel a bit more comfortable that my data will be hosted somewhere the cat can't reach it. (There's no reason you couldn't have your roll-your-own solution backed up offsite.)
posted by socratic at 5:23 PM on February 16, 2008


Best answer: Use rsnapshot. For the philosophy behind it (and any decent rsync-based backup solution), read this article.
posted by SemiSophos at 5:42 PM on February 16, 2008


re: multiple recovery points
  What grouse said, or the snapshots that others have been talking about


Some other things that could be handy, depending on how you use the files --- Unison and Subversion
posted by qxntpqbbbqxl at 5:58 PM on February 16, 2008


Response by poster: SemiSophos: That's an informative article, thanks. I'll probably put in the time to get something like that implemented (specifically, the part about hard links). It seems like that might work out really well.

socratic: I'd sign up with A3 if I had the money. Thanks for the reminder though.
posted by philomathoholic at 6:25 PM on February 16, 2008


Response by poster: qxntpqbbbqxl: I thought you meant I'd have to structure my backups differently, but I think rsync will do the delta-copy with individual files too (ie: tarballs).

I'll most likely switch over to using rsync, in conjunction with SemiSophos's article's suggestions.
posted by philomathoholic at 6:32 PM on February 16, 2008


Response by poster: Well, unless someone else comes in and suggests anything else, I've pretty much decided how I'm going to set it up.

I'm going to implement daily backups (using rsync for speed). Using hard links, I'll store 7 daily backups, 4 weekly backups, and 6 monthly backups. That'll give me plenty of recovery points, and it should take up less space than I'm using now! I'll store 3 months of monthly backups on A3 S3, which should only be about $10/month, which is pretty reasonable.
posted by philomathoholic at 9:31 PM on February 16, 2008


I have played around with backups using forests of hard links to make incremental backups look and behave like full backups, and it does work, but LVM snapshots work much, much faster since there's never any need to traverse the entire backup tree making hard links.
posted by flabdablet at 9:55 PM on February 16, 2008


I'd also recommend not bothering with tar. Tarballs (compressed tar archives) will save you a certain amount of space, but not much if most of your backed-up content is video and audio, but the fact of compression will mess up rsync's deltas. Uncompressed tar archives are, in my opinion, pointless on backup disks.
posted by flabdablet at 9:59 PM on February 16, 2008


Response by poster: So, tell me a little more about LVM snapshots. From the little bit I read of your link, it looked a little too complicated and required a little too much stuff that I'd be uncomfortable with.

I'll be backing up from NTFS (using cygwin) and ext3 (on a mac), to ext3 (on debian), over the local network. Would everything be compatible? Do I have to setup the source disks in any special way? How about the destination volume?

Each full backup would only be about 16 GB. In my testing, traversing my entire backup tree, creating hard links, took less than a minute. I'm not sure how much faster it could be, plus it'll be running on my dedicated backup server, so speed isn't a pressing concern.


With my new strategy, I'll be abandoning tarballs. I've added compression to my current scheme, but I'll remove it when I re-implement the whole thing.
posted by philomathoholic at 10:16 PM on February 16, 2008


For personal files (.bashrc, personal scripts, udev rulesets, etc) I just gzip 'em and mail them using bash and cron. For movies and whole discographies I use a socially-distributed backup: friends. I make sure that everyone knows what I have and pretty soon everyone has 'my backup copy' of whatever. I have theirs as well, and it's worked out wonderfully =)
posted by eclectist at 10:42 PM on February 16, 2008


RAID 0 and offsite backup storage.

Here's what I do, as a cheap way of covering my bases...

#1: I run RAID 0 at home, so that if I lose a drive I'll still have the other one;
#2: I periodically backup using rsync to a portable drive that I bring home on, say, wednesday night and bring back to work thursday morning, so that it's only in the same physical location as my home computer when I'm physically there as well.

This way, if there's a local drive failure I don't even have to recover from a backup, and if something goes horribly wrong (either epic data loss or, say, a fire) I have an offsite copy no older than the period of time passing between home visits of the portable drive (around a week or so.)
posted by davejay at 11:06 PM on February 16, 2008


Best answer: Short answer: only the destination volume would need special treatment if you were going to use LVM snapshots.

Once you'd created your backup filesystem on an LVM logical volume, you'd just rsync stuff into it like any other filesystem. Then, after each rsync session was finished, you'd allocate say 5% of your available disk space as an LVM snapshot of your backup volume. That would give you a complete mirror of your backup filesystem, unmounted and offline, frozen in time to the time you made the snapshot.

Subsequent rsync sessions could go straight over the top of the existing one. The changes would affect only your original backup volume, but would not alter any of its snapshots. If you wanted to restore files from a given snapshot, you'd just need to mount it like you would any other volume (on /mnt, say).

You can delete old snapshots by hand (or by script) any time you want, but even easier is just to let LVM do that itself: old snapshots will disappear, returning their storage to the available pool, when they no longer have space available to represent the differences between themselves and the original volume they're snapshots of.

If you've currently got disk space for four complete backup sets at 16GB each, you've probably got room for at least 30 usefully long-lived snapshots of your whole backup set provided you're only bringing that set up to date by writing changes to it each session (e.g. with rsync).

Long answer in the form of a quickie LVM primer:

LVM is a way of managing disk volumes that gives you much more flexibility and maintainability than you get from standard partitioning alone. The basic idea is that instead of putting your filesystem (ext3, ReiserFS or whatever) straight onto a disk partition, you put it on a logical volume instead.

Logical volumes, in turn, are created from pools of storage called volume groups. Carving a volume group up into logical volumes is analogous to carving a complete disk device into partitions, except that positioning doesn't matter. You can extend or shrink logical volumes at will, regardless of what you're doing with any other logical volumes in the same volume group, provided the volume group has enough unallocated space.

You can also create as many snapshot volumes within a volume group as you want, each snapshot being attached to any standard logical volume within the group. Snapshot volumes can occupy much less space than standard volumes, because all they have to store is the differences between the current contents of the volume they're based on and its contents at the time the snapshot was made. Even so, they appear to you as if they were the full size of the standard volume. All the underlying differy is completely transparent - as far as the file system is concerned, a snapshot volume looks the same as any other kind.

When there are more differences between a snapshot and its origin volume than will fit in the space allocated for the snapshot, the snapshot gets automatically discarded, and the space it occupied gets returned to the volume group.

Volume groups are themselves constructed from aggregations of physical disk devices or partitions. You can add extra physical space to a volume group at any time, just by adding another physical device to the group. You can also remove a physical device from a volume group at any time without data loss, provided that doing so still leaves the volume group big enough to hold all its existing logical volumes. You can even do RAID-0-like striping tricks if you want.

Concrete example:

My present toy server box has four disk drives in it: a 200GB Seagate drive, a 320 GB Seagate drive, and two 400GB Samsungs. Each of those has three partitions: 80MB for boot, then 1GB for swap, and the third, occupying the rest of the drive, for LVM.

Those four LVM partitions are combined into a single volume group that I've chosen to name vg0. Within that volume group, I currently have five logical volumes: /dev/vg0/dapper, /dev/vg0/gutsy, /dev/vg0/lv2 and /dev/vg0/lv3 are 10GB root volumes for assorted Linux distros to live in, and /dev/vg0/home is a 600GB volume for /home.

/dev/vg0/dapper, /dev/vg0/gutsy and /dev/vg0/home have ReiserFS file systems on them. /dev/vg0/lv2 and /dev/vg0/lv3 are currently unused.

I've just checked my disk usage with df, and I've found that /home is currently 94% full. I'd like an extra 100GB on it just to keep things comfy. So I do

sudo lvextend --size +100G /dev/vg0/home
sudo resize_reiserfs /dev/vg0/home


My /home volume is now 700GB instead of 600GB, and is only 84% full. I didn't even have to unmount /home to do this safely (thank you, resize_reiserfs!)

When it comes time to replace my smallest drive, I will dangle the new drive off a spare cable, partition it with the same three-partition scheme as the other four, format the third partition for LVM and incorporate it into vg0 with

sudo pvcreate /dev/sdx3
sudo vgextend vg0 /dev/sdx3


then migrate all the existing data off the small drive using

sudo pvmove /dev/hdg3

then remove the small drive from the volume group entirely using

sudo vgreduce vg0 /dev/hdg3

leaving it ready to be unplugged from the server box and replaced with the larger drive.

LVM is also smart enough not to care what device names its physical volumes end up with, so I can cable my drives up any way that's convenient after doing the above.

I don't have to halt my server, or even unmount any logical volume based filesystems while doing any of this except the actual hard disk installation and cabling. Eventually I will have an all-SATA box and be able to do the whole thing live.

When distro upgrade time rolls around, I can make a snapshot of the logical volume my existing root filesystem is on (making the snapshot the full size of the existing root volume so it will never automatically disappear), alter that filesystem's label and/or UUID, and run a distribution upgrade. If I need to roll back to the pre-update state, I can do that simply by rebooting and picking the old root volume's label. Creating a snapshot is much faster than copying a volume.

LVM is indeed a beautiful thing, and richly repays the time spent getting familiar with it.
posted by flabdablet at 2:21 AM on February 17, 2008 [2 favorites]


Uncompressed tar archives are, in my opinion, pointless on backup disks.

In general, your life will be a lot simpler without tar. But here's an exception to that—I once found some files were not being backed up because they were so deep in a directory that their pathname was longer than the allowable limit on the target filesystem. So just make sure that isn't the case.
posted by grouse at 2:50 AM on February 17, 2008


That's been a fairly regular occurrence for me too, when backing up a school Windows 2003 server to DVD's. Quite often, kids (and some staff who should know better) will fail to specify a filename when they save a Word document, resulting in Word packing as much of the first paragraph into the filename as pathname length limits allow. These files often blow up simple-minded automated backup scripts.

That's actually another nifty thing about using LVM snapshots to generate your restoration points: you don't need to insert any kind of timestamp prefix into the backup tree's pathnames. If you arrange for the backup filesystem's mount point to have a nice short directory name, you're unlikely to be bitten by this issue.
posted by flabdablet at 3:05 AM on February 17, 2008


FYI- LVM snapshots themselves are not backups. If a drive craps out, it's all gone.
posted by gjc at 6:19 AM on February 17, 2008


Response by poster: flabdablet: Wow, thanks for the full explanation. Due to the difficulty of reformatting my NAS, I'll hold off for now, but I'll be sure to give it serious consideration when replacing my current drive.

gjc: flabdablet indicates that I should rsync to the LV from the backup sources, and then store all the previous backups as snapshots. It's similar to what SemiSophos' article does with hard links, it's just a more space efficient way of storing multiple slightly different copies of the same data. I'd still keep some backups off-site.
posted by philomathoholic at 9:26 AM on February 17, 2008


By the time your current drive is due for replacement, new drives are almost sure to be much bigger. So, set up LVM on the new one before decomissioning the old one, then dd entire partitions off the old drive onto new LVM logical volumes. That's how I did mine, and it worked fine.

You might want to mess around with tune2fs to change volume labels and/or UUID's on the cloned volumes, just to make sure your boot loader and/or fstab can reliably distinguish them.
posted by flabdablet at 2:51 PM on February 17, 2008


Actually, you might want to cp -av all the files instead of dd'ing the partition they're on, since that would let you format the logical volume with reiserfs instead of ext3. I really like reiserfs for this job, mainly because the resize_reiserfs tool lets you grow a reiserfs file system to fit an expanded logical volume without unmounting it first.
posted by flabdablet at 2:53 PM on February 17, 2008


Have a look at rdiff-backup http://www.nongnu.org/rdiff-backup/

I suspect it can very easily do what you want.. it's like using rsync, except it will keep versioned sets of your files.. you can do backups hourly, daily, weekly, and so on, restoring from any point, without eating up disk space linearly with the number of backups.
posted by TravellingDen at 4:22 PM on February 17, 2008


« Older I've got a meeting with venture capital people on...   |   Large Site Network Infrastructure (Hardware)... Newer »
This thread is closed to new comments.