How do I back up a large amount of data to DVDs?
July 1, 2008 11:50 PM   Subscribe

Can you suggest improvements and help me work out the details of my plan to back up several hundred GB using the Mac?

I just dug a box of CD-Rs from 1998 out of a box - the first CDs I burned just after I got a burner. On them are files going back to when I started using computers, and out of 20 discs or so, only a few files on one disc were corrupted. I wouldn't have been heartbroken if any of these files went missing, but I would if a lot of files I've created since were to disappear, especially because photography has become digital. Additionally, I've recently had several thousand negatives of old family photos scanned.

My data is fairly well backed up across multiple hard drives, but I still feel somewhat vulnerable without having any backup which is not a hard drive. I have several hundred GB of data at this point for which I would like a backup I can toss in a box for 10 years like my CDs from 1998 -- something off-line, permanent, with at least several decades expected realistic lifetime.

My plan:

1. Create a DMG as large as I anticipate the total amount of data to be.

2. Copy everything I'd like to back up to this DMG.

3. Using some sort of standard terminal commands built into OS X which I forget (any help here?), split this huge DMG file into parts sized to fit onto DVDs (haven't decided between single and double layer yet).

4. Using MacPar create parity files sized the same as the regular part files. The advantage of doing this is that I can say "as long as less than 10%, or any percentage I decide, of discs are corrupted when I want to restore my files, I can restore all of my data bit-for-bit." The disadvantage of this is that if too many discs are corrupted, I lose EVERYTHING, as opposed to if I had just put files on discs without enclosing them in an image.

I feel that the tradeoff described in #4 is worth it, considering that it is a royal pain in the ass to manually split up a ton of files onto separate discs.

Another thing that I have thought about is that this scheme relies entirely on formats which are pretty standard, maybe even open source. PAR, DMG, and the unix-based split/join commands (hopefully someone will remind me what these are) will be possible to access for a very, very long time. The weird proprietary format for some random backup program, not so much.

I anticipate that it will take a few hours at once to create the split image files, perhaps overnight or more to create the PARs (but this time is unattended), and then I can burn them over a few days/weeks. Not too bad.

Can anyone comment on this, or suggest an easier/better way to create an optical backup of 200 GB of files or so?
posted by david06 to Computers & Internet (15 answers total) 3 users marked this as a favorite
 
In order to not have the limitation of step #4, instead of DMGs, use ISOs. Burn your files normally, directly to the disc, and then create ISOs from the burned discs. Make PARs from the ISOs. You gain the ability to access individual files, but you still have redundancy. Also, I recommend finding something to make PAR2 files, which are a bit more flexible than PARs in that they work on block, not file level.
posted by zsazsa at 12:10 AM on July 2, 2008


Oh, I'm sorry. I wasn't thinking, that means you'll have to deal with splitting up the files manually. Never mind me.
posted by zsazsa at 12:11 AM on July 2, 2008


I'm not a huge fan of steps 1-3.

In fact I'm really not sure why you'd want to pack everything into a DMG and then segment the DMG before burning. That's an extra layer, and it means if you lose one (well, a couple, depending on how you do the PARing) of the discs, you're hosed. All your data is potentially gone. This defeats all of the benefit of putting your data onto a bunch of relatively small media containers (like CDs/DVDs). Plus, DMG files are a proprietary format; I don't know if Apple even publishes a specification. If you really were hell-bent on archiving everything in one file first, you'd be much, much better off using GNU tar.

So anyway, what I'd do is just skip the DMG-creation part. Instead, make a whole bunch of directories, labeled Disc1, Disc2 ... DiscN. (You can create them as-needed.) Turn on "Show all sizes" for the window enclosing them. Then just drag stuff into them and try to get each somewhere around the 4GB mark.

Admittedly, this makes it tougher to add PAR files to the discs (you'd have to PAR the individual files, rather than the whole thing), but IMO you more than make up for this by eliminating the huge fault risk of a disk image into which all the data goes. A scratched disk might result in some data loss to a few files, but it probably won't ruin the disk, and it definitely won't ruin your entire archive.

[As a sort of sidenote: I've always thought it would be neat to write a shell script that would take a folder full of data, tar it up, and then calculate PAR files for it, using a redundancy value for the PAR calculation necessary to bring the total archive+PAR space requirements up to 4.3GiB, so you could burn the whole mess to a disc. Having to do this manually kinda sucks.]

Anyway, the simpler you can make your storage format, the easier it'll be to recover stuff later, and the more likely you'll be to be able to get anything off. A giant disk image segmented across dozens of discs would be a major discouragement if I ever wanted to retrieve a file; if you just dump the files onto discs you avoid that.

One other thing I'd do would be indexing each disc once you burn it. There are some nice shareware utilities out there for doing just this, or you can just dump a file list using the "find" command from the Terminal if you're cheap. Either way, it'll be nice to easily locate the disc a particular file is located on later.
posted by Kadin2048 at 12:18 AM on July 2, 2008


I have nothing to contribute to this, but I was planning to ask a similar question, just a less advanced version of it.

I was hoping to simply find some app or script that would neatly divide all the files I designate into equally divided folders of <4>
What are the advantages of dmg's and iso's?

Or is there simply no automated way to divide the 100's of gigs into equal bins?
posted by prophetsearcher at 12:47 AM on July 2, 2008


Erm. This is why people still buy Toast. Automagically does all this sorting/splitting/etc for you.

Several hundred GB of files is no fun to burn 8GB at a time. Consider picking up a Blu-Ray burner.
posted by mmdei at 1:19 AM on July 2, 2008


Response by poster: Toast doesn't do this in a way nearly as safe for long-term storage as even the flawed scheme in my original post.

Everything above has been really helpful. The biggest thing I've gotten from it is that if I keep my "one large file" idea, I should copy everything to a folder and tar the folder rather than copy everything to a DMG, due to tar being a far more universal format.

I'm still kind of torn on the whole individual files vs. one large image thing. The two things I'd like to add on that point are:

- I'd be far far more likely to do things this way if an automated tool existed for all the splitting that would still allow me to use PAR or some other parity system to protect the data somehow. If that existed (or if I knew of its existence) I would have probably never thought about PAR files and stuff, I'd have just burned two copies of everything and hoped for the best.

- It's not a big deal if it is inconvenient to recover the data. I have everything on multiple hard drives. One of those "hard drives" is actually a fault-tolerant RAID. This is not my first backup of anything, it's an "oh shit multiple backups have failed" solution.
posted by david06 at 1:31 AM on July 2, 2008


You claim you want long term storage, so TAR over any other scheme only because of the UNIX. Otherwise I'd avoid any compression to make the files as 'normal' as possible. Even then there is no guarantee that you'll be able to read the files because of the unplanned obsolescence of file types. (Try opening a MacPaint file, Try opening a MacWrite file, without the original software, and those programs are only 24 years old.)
posted by Gungho at 4:22 AM on July 2, 2008


I would err on the side of simplicity, myself.
I would ditch all the DMG/PAR/TAR/ISO steps and simply save the files normally. This will actually minimize the effects of any data loss. It's better to lose a few normal files and still have the vast bulk, rather than lose everything because of a corrupt PAR (or whatever.)

Your main concern should be with what your storage medium of choice will be. DVD seems to be the de-facto choice. The unfortunate reality with digital storage is that none of it is especially archival. CDs and DVDs do become unreadable over time. Unfortunately, they are really the only viable format available to the consumer. But, 25 years from now, there's no guarantee that any computer you may have will even be able to read a DVD. I mean...when was the last time you saw a computer for sale that featured a Zip drive, for instance? Bernoulli discs, anyone?

I guess what I'm saying is this: Simply save the files to DVDs. But, be prepared to re-save them onto a new storage medium within the next 5-10 years. And probably continue to do so over time in order to assure the files will be accessible using whatever contemporary technology exists in the future.
posted by Thorzdad at 4:56 AM on July 2, 2008


Go out and get yourself a 1 terabyte external USB drive. I think they're like 150-200$
More expensive than burning DVDs, but definitely a lot more practical.
Then store the hard drive securely...
posted by PowerCat at 5:47 AM on July 2, 2008


Mark Pilgrim, Long-Term Backup. And the followup. The takeaway:
Long-term data preservation is like long-term backup: a series of short-term formats, punctuated by a series of migrations.
posted by holgate at 5:57 AM on July 2, 2008


Your scheme is a lot of hassle, and DVD blanks are cheap.

Why not just burn two of every disc?

(This is similar to the "RAID 5 sucks, just use RAID 1" argument)
posted by OldMansHands at 6:18 AM on July 2, 2008


I'd suggest dar rather than tar for this kind of job. The extra features (including file splitting built in) make it worth it. Works just fine on a Mac. Only thing it won't do for you is the redundancy stuff in case you lose a disk.
posted by edd at 7:10 AM on July 2, 2008


Files are like houses: if they're used they'll last forever. It's when you forget about them and ignore them that they degrade. Anything involving discs is going to be a poor solution because it will be a huge undertaking to update/recreate, and then you're back to relying on this one instance of backing up.

Two suggestions:
1) Buy maybe half a dozen hard drives. Generate MD5 sums of all your files and store the sums along with the files identically on all the hard drives. Label them clearly 1-N. Now rotate them regularly (monthly? bimonthly?) (rsync 1 to 2, put 1 in a box and put 2 in your machine), send one to your parent's house in Nova Scotia, etc.

2) S3. Let someone else (specifically Amazon) handle the bullshit for you. You'll pay for bandwidth and storage, it'll likely take a week or two to upload (unless you've got someone with gigE who owes you a favor). Take some time and let whatever tool you use to upload populate the Content-MD5 field on upload. Assuming 200 Gigs it'll cost you $20 to upload and $30/month to store it, calculator here. You're safe unless your house burns down at the same time as Amazon crashes, in which case I suspect the nuclear war that caused it all will be your more pressing concern. Plus now your stuff is now on the top of a very, very steep bandwidth pipe and you can easily make those files available to anyone.

This is, roughly, what I do with a smaller amount of digital photos (~80Gigs). I
posted by Skorgu at 11:06 AM on July 2, 2008


I've been thinking about this last night and today. I haven't really had an epiphany or anything, but it seems odd that I can't find anyone else's solutions to this issue.

Taking a whole bunch of files, putting them in a tar archive, breaking that archive into fixed-size chunks (either the size of the DVD, or the size of the DVD less space for PAR files), and then burning the chunks ... that's not hard at all. dar does it with "slices" and tar does it with the multivolume and --tape-length options.

I really just don't like the idea of archiving-and-splitting at all; it just screams "bad idea!" to me, because it creates a failure mode where losing one disc could potentially mean losing the whole archive. I just can't get myself on board with that idea.

Really what you want is a shellscript that takes files you want to back up and 'packs' them into directories, attempting to optimize each directory so that it's as close to a predefined target (say, 4GB) as possible. Then you could tar and PAR each directory independently, and you wouldn't have the nasty disc-spanning issue—each archive would be totally independent. (Or you could use 'dar', as edd suggests, which can do PARing very easily using a single command.) That would accomplish pretty much everything you want to do, I think. Unfortunately I'm not sure off the top of my head how best to do it.

It seems like this is a Perl script that might do it, but Perl is Greek to me. There's also a Windows utility for this exact purpose. Unfortunately, I can't find anything for Mac/Linux. (It's possible there's some trivial way to do this that I'm missing...?)

I guess if you're really unconcerned with the failure mode in which loss of a single disc might compromise the integrity of the entire archive, tar (or dar) away, add your PAR (which can be automatic if you use dar), and burn. The website for dar lists a bunch of sample scripts, some of which might be overkill but provide good examples.

Anyway, you've piqued my interest, and I'll keep thinking about it, maybe ask around to some more knowledgeable folks and see if some really elegant method that I've missed pops up.

It's worth noting that the author of DAR is not unaware of the ugliness of disc-spanning archives. Unfortunately it's not going to get solved any time soon, and has to be that way for some valid implementation reasons, leading me to further believe that the Right Way is to break the files to be backed up into appropriately-sized chunks before archiving and parity is added, not after.
posted by Kadin2048 at 8:52 PM on July 2, 2008


Fun fact: tar files (almost) always work if you manually split them. Meaning, you can chop it up & start restoring stuff from the middle without going from the start.

Huh. I haven't actually written about this before, so this description might be a bit choppy: I made an alphabetical list of files, kicked off a tar from that list, fed into a perl script that would split it into 700 meg files, pausing in between. Take six of those CD-sized files, put them in a directory, & create some checksums for that directory - md5 (error-detection) & par2 (error-detection).

After burning a DVD & verifying the checksums, you could remove the data but not the checksums - instead, later, treat the checksums as the data & make stage 2 checksums & DVDs for them.

Separately, the same perl script could read the chopped tar files & feed them into a tar test, letting you create a list of which files went into which tar files. Copying these catalogues over the DVDs is also good.

Caveats: my par2 doesn't like working with thousands of files. If you have CDs with lots of files on them, I'd backup an ISO rip of it; those are mountable directly. I'd strongly recommend par2 over par; if the dvd table of contents goes bad, keeping you from even getting a directory listing, you can recover whatever you can rip from the rest of the disc image.

Is this useful?
posted by Pronoiac at 2:21 PM on July 13, 2008


« Older Hūsker Dū?   |   Hurry up and weight? Newer »
This thread is closed to new comments.