Block device timestamps for backup support
October 4, 2010 8:48 PM   Subscribe

In the shower this morning, I had an idea about backup support for disk drives. Since all my best inventions either have fatal fundamental flaws or have already been invented years ago by other people, has anybody already done this one for Linux?

Briefly, what I have in mind is something that looks just like a standard block device, except that it also timestamps every block it writes with the current system time. Timestamps would want to be accurate, but not super-precise; a standard 32-bit Unix time would be sufficient. Assuming a 4K blocksize, then, 1024 timestamps would fit in a block.

A device driver or filter layer would reserve (total blocks / 1024) blocks on an underlying block device for keeping the timestamps in, making the remainder available as a slightly shrunken but otherwise completely standard block device.

Both the live device and any corresponding backup devices would work this way. There would also be a device UUID to associate the live device with its backups.

During a backup or restore operation, timestamp data on source and destination devices would be compared, and the only blocks copied from source to destination would be those containing data written no earlier than T seconds before the corresponding destination blocks. Destination blocks would be timestamped with their actual time of writing rather than a copy of the source block timestamp. T would be small enough to keep superfluous copying to a practical minimum, but large enough to prevent missed backups due to concurrent writes and/or clock drift and granularity.

This would allow a full disaster-recovery mirror backup to be updated on media normally stored offline and offsite in as little time as a traditional incremental file backup, regardless of the partitioning arrangements and/or filesystem(s) in use on the live device. Disaster recovery could be a simple as physically plugging a backup drive into the bay the failed live drive used to occupy. This would be particularly valuable for servers hosting virtual machines.

So what's this idea's fatal flaw? And if it doesn't have one, who has already done this for Linux, and what's it called?
posted by flabdablet to Computers & Internet (34 answers total)
 
Maybe I don't fully understand your plan, but journaling sounds familiar. I think ext4 is a journaling file system.
posted by Blazecock Pileon at 9:05 PM on October 4, 2010


Right, so is HFS+ (the Mac OS X file system) and NTFS.

But journaling is less for traditional "backup and recovery" purposes than to make sure that file operations can completely successfully even if there's a power failure.

As an aside, I remember a few years ago when I was spec'ing out a backup system for a small office I found out that EMC and IBM owned an enormous number of B&R related patents, and some features (searching for files across sets i believe was one) were only available in their products.
posted by Oktober at 9:12 PM on October 4, 2010


Response by poster: As I understood it, journaling typically happens at the filesystem layer, and journals only persist for as long as they need to in order to make filesystem updates appear atomic.

What I'm talking about is a facility that would allow me to back up an entire block device to another, take the backup drive away, reinstall it a week later and run another backup onto it; that second backup would go relatively quickly because the only data copied would be what's changed that week, but I'd still end up with a complete mirror of my live drive.
posted by flabdablet at 9:17 PM on October 4, 2010


One potential issue I can see off the bat (which may become a non-issue as SSDs become more prevalent) is that every write to the device now implicitly involves writing to two (or more, depending on the number of blocks written and the level of fragmentation) non-adjacent sectors of the underling physical device, which could lead to some pretty poor write times, as the amount of seeking is greatly increased.
posted by james.nvc at 9:24 PM on October 4, 2010


The general process is similar to what is done with most synchronization/backup software, albeit at a higher, more abstract level. I'm not aware of what you're suggesting existing, but with 100% certainty that certainly doesn't mean it doesn't exist. Perhaps what you are proposing at a lower level than the file system would demand greater processing overhead and the satisfaction of certain performance criteria to be successful as a consumer product.
posted by Blazecock Pileon at 9:27 PM on October 4, 2010


So you're going to keep around all 1.5 TB of my disk when I'm only using 50GB of it? And when I run something to erase data from all of the unused blocks you're going to re-backup that 1.45TB of data just because those blocks were changed?

When I edit a 400MB video and add meta information at the beginning causing the last 399MB of the file to be shifted by 20 bytes you're going to have to backup all of those blocks again instead of just backing up a delta of 20 bytes?
posted by zengargoyle at 9:32 PM on October 4, 2010


I don't know exactly how it works, but you should probably see if you can find any data on the NetApp WAFL block-level snapshot procedures.
posted by benzenedream at 9:53 PM on October 4, 2010


Time Machine on OS X already does something similar to this, but at the FS level instead of the device level. From wikipedia:

Time Machine creates a folder on the external drive whose name contains the current date and time. It then copies the entire primary hard drive[1] (except for files and directories that it has specifically been told not to copy) to the folder. Every hour thereafter, it creates a new folder on the remote drive using the same naming scheme, but instead of making another complete copy of the primary hard drive, Time Machine instead only backs up files that have changed and creates hard links to files that already exist on the remote drive. A user can browse these "versions" of the primary drive and see each file as if it were right where it was left.

One clear benefit over doing it like this in the FS layer instead of the device layer is that browsing and inspecting the backup is as easy as browsing the filesystem with your normal tools. If you did it at the device layer then you'd have to write some special interface and drivers to access your old backups.
posted by sbutler at 10:01 PM on October 4, 2010


Also, look up "snapshot", as this may be similar technology. (eg. you create a "logical drive" which is a "snapshot" of another drive at a point in time, yet it doesn't take up that much space because it works in a manner similar to what you describe.)

For example, circa 2002:
http://tldp.org/HOWTO/LVM-HOWTO/snapshotintro.html
posted by smcameron at 10:03 PM on October 4, 2010


Also, if you wrote a Time Machine clone for Linux I think the community would be very appreciative. OS X added secret hard link support to directories to make it work efficiently... not sure if you can hard link directories in Linux.
posted by sbutler at 10:06 PM on October 4, 2010


second james.nvc; there is no need to time-stamp the block; time-stamp at the file level is fine. Besides the seek time issue mentioned; generally you do not want to write software assuming such specificity of hardware. Abstraction exists to allow independent development in each layer without causing incompatibility. The file system is an abstraction created by the OS to shield application programmers from the nitty-gritty detail of hardware implementation. You are not a hardware engineer, nor an OS programmer, so you know nothing of the trade-off each has to make to ensure that the hardware performs optimally. What if it's a SSD, or a RAID array, or a tape drive? You don't want to rewrite your app for each type of hardware, do you?
posted by curiousZ at 10:14 PM on October 4, 2010


Read up on ZFS and its many cool capabilities, such as copy-on-write, which lets you make snapshots much like you describe.
posted by zsazsa at 10:56 PM on October 4, 2010 [1 favorite]


Response by poster: Blazecock: NTFS does in fact include the ability to create a persistent journal. It doesn't do the same job as what I'm proposing, because its job is to record changes rather than instrument the current state.

If I were to attempt to base an offline backup strategy on that journal, I could blow it up by doing millions of small writes to one file. This would fill up the journal and make it start dropping old change records, at which point it's no longer useful for computing the difference between what's on my live drive and what's on the backup.

It also only works for NTFS filesystems. What I'm proposing is simple enough to be implemented at the block device level, which means you could use any filesystem you liked on top of it.

james.nvc: yes, there would be a performance hit, but it should be reasonably small since timestamps are so much smaller than the data blocks they describe. One strategy worth investigating would be to scatter timestamp blocks throughout the device instead of bunching them all up at the end, so that every 1025th block contains the timestamps for the preceding 1024. The filter driver could adjust the block numbers appropriately or, if you wanted to make the device compatibly accessible via a non-timestamping driver, could simply return a read error for any attempt to read a timestamp block and rely on the filesystem to skip them (every filesystem available, as far as I know, includes some mechanism for handling unreadable disk blocks).

zengargoyle: yes.

The use case I have in mind is rapid disaster recovery backup for small-business servers that host a collection of VMs. The intent is to make backing up such a machine achievable in a reasonable amount of time onto the next of a rotating set of usually-held-offsite hard disks, any one of which could simply be plugged into the machine's replacement and booted up after recovery from fire, theft or flood.

Disasters are infrequent, hard disks are hella cheap, and using them this way could be really cost-effective for a heap of small businesses for whom online redundant servers are overkill.

So you're going to keep around all 1.5 TB of my disk when I'm only using 50GB of it?

If you've built a computer with a 1.5TB drive in it, you presumably have a use for that much space, and you'd presumably want that much space available again after disaster recovery. So you'd buy a handful of 1.5TB drives and use them as a rotating backup set. What do they cost these days - $100 each? Meh.

Also, if you've set up the machine containing your 1.5TB drive with a timestamping block driver, but you've only ever written 50GB of data to it, then the first time you back it up you'll only end up writing the exact 50GB you've used onto the backup drive. That's kind of the point.

And when I run something to erase data from all of the unused blocks you're going to re-backup that 1.45TB of data just because those blocks were changed?

That seems reasonable to me. If you've actually bothered to overwrite 1.45TB of unused blocks, that's presumably because you don't want whatever was already in them just left lying about; so why would you not also want them scrubbed on your disaster recovery backup?

I agree that if all you want to do is scrub the unused blocks, you'd probably be using random data and it probably would waste time to have the backup routine bother to duplicate it. But you could avoid that by explicitly scrubbing the backup drive as well. If you did a backup before the scrub, you could then run the live-drive scrub and the backup-drive scrub mostly in parallel; provided you began the backup-drive scrub more than say 10T seconds after the live-drive scrub, the backup drive's scrubbed sectors would end up marked new enough to avoid having them copied from the live drive on the next backup run.

In fact, if your data were sensitive enough to call for a big scrub like that, you probably wouldn't do it that way. You'd probably be using an encrypting filesystem (which would work just fine on top of a timestamped block device) from the get-go.

When I edit a 400MB video and add meta information at the beginning causing the last 399MB of the file to be shifted by 20 bytes you're going to have to backup all of those blocks again instead of just backing up a delta of 20 bytes?

Backup software capable of generating that 20 byte delta will typically need to read the entire 400MB file anyway to see what's changed, which is not going to take much less time that just copying 400MB from one block device to another. Backing up any given file would take about as long as saving it in the first place.

benzenedream, sbutler, smcameron, zsazsa: Snapshotting is fine for reaching back in time, but leveraging it to speed up disaster recovery backups is rather more complicated than what I've proposed. Sure, it's possible - these guys do it, among others - but it seems to me that disaster recovery and error recovery are different enough to warrant different approaches.

Error recovery, which is typically what you'd want snapshotting for, is most useful if the snapshots are readily available online. That means you probably want to be making them on one of your live drives, and in fact keeping them inside the same filesystem they are snapshots of is generally the most convenient thing to do.

Disaster recovery requires capturing the entire state of a live drive on a backup drive that can be taken away and kept in a separate building. I can see no particular virtue in having snapshots available on your disaster recovery drive that don't also exist on your live drive.

So, a good disaster recovery facility will give you a copy of everything that was on your live drive at the time you made the backup - filesystems, snapshots, volume managers and Uncle Tom Cobbleigh and all.

If the live drive and the disaster recover drive are both accessed via a timestamping block driver, then all that stuff should, in theory, Just Work.

If the live drive were in active use while the backup run happened, the backup would indeed end up in an inconsistent state. But provided backup runs are scheduled at times when load on the live drive is light, it should be possible to get a consistent backup simply by running successive passes until the number of blocks that need copying is small, then pausing the source device during the last backup pass (similar to what happens during live migration of a VM).

CuriousZ:

there is no need to time-stamp the block; time-stamp at the file level is fine.

Timestamp per file doesn't work very well for large files containing lots of frequently-updated internal information, such as filesystem images for virtual machines.

Besides the seek time issue mentioned; generally you do not want to write software assuming such specificity of hardware.

There's nothing tremendously hardware-specific about a generic Linux block device, which is what I'd be building my timestamped block device on top of.

Abstraction exists to allow independent development in each layer without causing incompatibility.

That's precisely why a timestamped block device should support the same API as any other block device. It would need an additional ioctl that the backup app could use to access the timestamp information, but as far as any filesystem or LVM or dmraid or mdraid built on top of it was concerned, it would look just like any other block device.

The file system is an abstraction created by the OS to shield application programmers from the nitty-gritty detail of hardware implementation.

Actually, the file system is an abstraction layer that gives application programmers a standard set of methods for remembering what they stored in which block(s) of the underlying block device(s).

A block device is another abstraction layer that allows implementation of filesystems and other facilities like LVM without needing to deal with the differences between assorted kinds of hardware.

You are not a hardware engineer, nor an OS programmer

Them's fightin' words :-)

so you know nothing of the trade-off each has to make to ensure that the hardware performs optimally.

I think my grasp of the fundamentals is adequate.

What if it's a SSD, or a RAID array, or a tape drive? You don't want to rewrite your app for each type of hardware, do you?

No. That's why I'd be interested in seeing this done as a standard Linux block device driver.
posted by flabdablet at 11:14 PM on October 4, 2010


What I'm talking about is a facility that would allow me to back up an entire block device to another, take the backup drive away, reinstall it a week later and run another backup onto it; that second backup would go relatively quickly because the only data copied would be what's changed that week, but I'd still end up with a complete mirror of my live drive.

Forgive me if I'm misunderstanding you, but it sounds to me like you are basically describing point in time block replication (vendors mostly call it something like "snapshot replication" or "snap copy"), which is a feature of many enterprise SAN's. You can snap a LUN, replicate it to another SAN unit, then later you can snap and copy it again and it will only replicate the changed blocks over, so it's really fast. Differential block copy. I don't think timestamps are the primary mechanism for determining whether a block has changed though, I think most of them use hashing or other fancy magic. Most high end SAN stuff (EqualLogic, Lefthand) can do this, and some entry level stuff (HP MSA) can too.

In software, DRBD can also provide block level replication in this manner. We do exactly what you describe at work every week. We backup our Oracle databases to a storage server at the primary site, then use DRBD to sync the changes over the WAN to the remote site. It's super neat.

If I'm not talking about the same kind of thing you are then I apologize, it's late and my brain is possibly parsing your question wrong.
posted by tracert at 11:27 PM on October 4, 2010


Also I should add that the exact scenario you are describing where you snap copy to a set of disks you update weekly and then keep them in a safe, that you can most definitely do on a Dell MD3000i. I know because I have two. Horrible, awful machines. But they can do that.
posted by tracert at 11:45 PM on October 4, 2010


Response by poster: tracert: yes, you're talking about the same kind of thing. What I think I've dreamed up is a reliable mechanism to allow software to do this reasonably efficiently, using ordinary SATA hard disk drives plugged into my cheapo small-business server instead of an expensive (or possibly horrible and awful) SAN box, and not requiring continuous access to a second server as I believe DRBD does.

The basic problem is working out which blocks need to get copied from source to target to make them the same, even when the target is not, in general, the same one copied to last time (thereby allowing the use of a ping-pong or rotating set of targets).

Seems to me that recording a timestamp for each data block could achieve this, with minimal performance cost and low complexity. I like minimal performance cost and low complexity.

If nobody else actually has done this already, I think I'll roll up my sleeves and start learning about device-mapper.
posted by flabdablet at 12:17 AM on October 5, 2010


What you are proposing is dump, which is the classic Unix backup tool. Yes, it's a good idea (ignore the naysayers), but you were right in your suppositions that it has been done before.

The usual approach is to backup all blocks dated after some particular time and to take a multi-level approach to it with occasional full backups followed by progressively finer differential backups.

NB the presence of the dump-level column in /etc/fstab for exactly this purpose.
posted by polyglot at 4:53 AM on October 5, 2010


Response by poster: I don't think it's the same thing. The dump manpage says it figures out what needs dumping by examining file modification times. Also, dump's output is a file in dump archive format suitable only as input to restore, rather than a block-for-block identical mirror of the original device.

The nice thing about backing up a timestamped block device would be the ability to generate exactly such a mirror - functionally equivalent to a level 0 dump, apart from not requiring a restore before being usable - while reading only (device capacity / 1024 + number modified) blocks and writing only (number modified * 1.001), while needing to pay no attention at all to filesystem details.
posted by flabdablet at 5:33 AM on October 5, 2010


Years ago I used a version of rsync that did block level copying and the performance improvement was dramatic. I don't believe it used checksums to compare blocks and the cost of examining every block was worth it when only parts of files were changed. This is a different scenario, but similar enough that I thought I would mention it.
posted by dgran at 9:00 AM on October 5, 2010


What you're talking about almost certainly exists somewhere between WAFL, ZFS, and the various utilities that interact with one or the other. Last I checked NetApp and SunOracle were suing the fuck out of one another for violation of the kazillion patents that they each have related directly to things you're talking about. There should be enough references to the patents in question (and related prior art) in that lawsuit summary to lead you to the patent(s) you'd be looking to avoid/violate... :)
posted by togdon at 10:47 AM on October 5, 2010


Also, if you want to contribute to a project rather than go your own way, check out BTRFS, the linux clone of ZFS. It's being included in the linux kernel as a development filesystem, and has block-level snapshot features. Seems like it has a way to go from the wiki: "Note that Btrfs does not yet have a fsck tool that can fix errors."
posted by benzenedream at 12:55 PM on October 5, 2010


Best answer: This kinda sounds like SVN blame, without the usernames and on a lower level. If I were you, I'd ditch clocks, because one bad CDT bug or UTC misconfiguration could screw over a critical backup, and replace it with logical clocks, aka version numbers. At which point you're really getting into Versioning Filesystems, of which there are plenty for you to study, including (according to wikipedia) ZFS and Clearcase.
posted by pwnguin at 6:26 PM on October 5, 2010


Response by poster: pwnguin: You're right - version numbers work even better than timestamps. They could also be smaller than timestamps, since they only need to change once per backup pass instead of once per second; 16 bits should be plenty, which reduces the housekeeping overhead to one 4K block in 2049.

This isn't intended to support filesystem versioning. I don't envisage any dynamic block address remapping; not trying to reinvent snapshots or other copy-on-write things, since LVM and assorted filesystems already have that covered. All I want is a simple, robust, low-overhead way to keep track of what blocks are new on the live device wrt an offline mirror backup, so that the backup can be brought up to date without performing too much unnecessary I/O.

Schemes based on copy-on-write snapshotting necessarily involve a tradeoff between usable storage space and snapshot lifetime; the simple scheme I'm thinking about imposes a very small, fixed space and speed overhead and should work without needing to be thought about much.
posted by flabdablet at 8:37 PM on October 5, 2010


Look at 'zfs send' and 'zfs receive'.
posted by PueExMachina at 9:21 PM on October 5, 2010


Also, look into this feature of btrfs.
posted by PueExMachina at 9:24 PM on October 5, 2010


yes, zfs and btrfs solve this. or if you really want it on a block level, look at SRDF which is EMC's SAN solution. I'm sure other SAN providers do something similar.
posted by jrishel at 5:55 AM on October 6, 2010


the simple scheme I'm thinking about imposes a very small, fixed space and speed overhead and should work without needing to be thought about much.

But thinking is much cheaper than implementing it, and you've asked us to think!

Let me propose one more simplification, and see how it strikes you. Instead of revision numbers, have a "dirty" bitmap like the free block map, that represents whether a block has been written to since last backup. It's even more micro-optimized, and captures what you really care about.

As best as I can tell, this is just a very sloppy copy (backup) on write. Thoughts?
posted by pwnguin at 4:10 PM on October 6, 2010


Response by poster: Instead of revision numbers, have a "dirty" bitmap like the free block map, that represents whether a block has been written to since last backup.

That only works if there's a 1:1 relationship between live and backup devices. I'm looking for a way to support the use of a rotating set of backup drives, so that there's no risk at all of violent electric death taking out both the live device and its only backup during a backup session. It would also require the backup process to write to the source device in the course of backing it up, which I'm not happy with; I'd rather avoid deliberately building race conditions into the design.

Timestamps are nice because they imply nothing about relationships between the devices that bear them; they're just information that can be used once such a relationship has already been established by e.g. checking a backup-set UUID.

Using backup-pass counters instead of timestamps is somewhere in between.

I've gone off the idea of using 16-bit counters, by the way; these only allow "older than" to be reliably distinguished from "newer than" if the numbers involved are either never allowed to grow beyond 65535 or never allowed to differ by more than 32767. If we assume one backup session per day, over a device life of 5 years, that means we'd be building in failure if it could take more than a few tens of passes to complete a backup session. That's too close to the wind for me.
posted by flabdablet at 6:57 PM on October 6, 2010


Response by poster: Oh, and just to be clear: I am absolutely in favour of thinking carefully during the design stage! I just don't want to design something that the operator will fatally screw up if deprived of sufficient caffeine.
posted by flabdablet at 7:58 PM on October 6, 2010


Response by poster: PueExMachina, jrishel: I'm not yet really convinced that btrfs does actually have this covered well, for the use case I'm considering (efficient backup of a device/filesystem containing an assortment of other filesystems nested inside virtual disk images).

Given
Every btree block and file extent include the transaction id of when they were created. When COW is on, this means they include the transaction id of when they were last modified.

Finding updated file extents means searching through the tree based on transaction id (ignoring any branch in the tree older than transid X), which is exactly what the treelog code does to efficiently log fsyncs.
it's clear that btrfs does log enough information to limit backup traffic to updated extents, provided it's mounted with COW turned on.

But it seems to me that turning on COW would result in a pretty severe performance hit for a btrfs containing mainly VM disk image files: just about every disk write performed by the VM would fragment its image file, breaking up the extent containing the block it just wrote. If I were using a btrfs to contain VM disk images, I'd certainly be mounting it -o nodatacow, which it seems to me would screw up any backup optimization based on btrfs extent transaction IDs.

Even with COW turned on, figuring out which new extents on the live device correspond to parts of old extents already on the backup device and therefore don't need to be copied would be a fairly fiddly business, I would think.

If I'm wrong about any of the above, I'd love to be pointed to documentation that explains why.
posted by flabdablet at 8:52 PM on October 6, 2010


Response by poster: pwnguin: I don't see this so much as COW, sloppy or otherwise. What I'm after is something that can do pretty much what rsync already does, but can work out what needs updating without full-length data reads at both source and destination.
posted by flabdablet at 9:01 PM on October 6, 2010


flabdablet: "
Timestamps are nice because they imply nothing about relationships between the devices that bear them; they're just information that can be used once such a relationship has already been established by e.g. checking a backup-set UUID.
"

Actually they do. You're relying on some universal clock service to infer which blocks need not be updated. I still think trying to use timestamps is going to be a failure. Anyways, you've exhausted my knowledge of filesystem implementations. If you don't get a satisfactory proof of incorrectness out of mefi, you might try lkml, and if you still press on, I look forward to reading your USENIX publication :)
posted by pwnguin at 9:23 PM on October 6, 2010


Response by poster: Actually they do. You're relying on some universal clock service to infer which blocks need not be updated.

Granted. My point was that what gets written to the timestamps is independent of what also happens to have been written to any other device. There's no relationship between the tracking information on one device and that on another beyond the fact that both were sourced from a global clock.

And sure, clocks can be wrong; but ntp makes that a solved problem on most systems, and the clock would have to go backwards by more than the time between successive backup passes before blocks that should be copied started being missed.

Using counters instead of timestamps would require that the counter values written to the backup device match those read from the source device for any given block, which means that there is now a causal relationship between tracking info written to the backup and that on the source device, which means I'd need to implement an ioctl to set the current tracking number in addition to the one the timestamp-based version would need for reading them. That's not necessarily a bad thing, but it is an extra thing.

So it seems I now have a bit of a learning curve to climb. First thing is working through the process of compiling and installing a kernel. Hey ho, let's go.
posted by flabdablet at 10:06 PM on October 6, 2010


Response by poster: And... there is a fundamental flaw. Thanks, pwnguin.

Destination blocks would be timestamped with their actual time of writing rather than a copy of the source block timestamp.

This could cause breakage after a backup device was pressed into service as a live one. So it would be better to copy the timestamps during a backup pass, rather than regenerating them; which means I need my extra ioctl anyway; which means I might as well use counters instead.

Sounds like a plan.
posted by flabdablet at 10:15 PM on October 6, 2010


« Older No Halloween costumes without irony.   |   Not jumping for the bed? Newer »
This thread is closed to new comments.