In a journaling file system, why are journal writes more reliable than file system writes?
April 10, 2007 3:56 PM   Subscribe

In a journaling file system like ext3, a journal entry is written before each file system change, describing the change about to be carried out. This allows quick recovery if the actual file change is interrupted or not carried out due to power outage or whatever. But why is the act of writing the journal entry not susceptible to the exact same threat of being interrupted?

I've read several descriptions of journaling file systems, but I can't find an explanation as to why you're not just pushing the same interruption/inconsistency problem further up the chain.

I can guess some possible answers: (a) the journal entry is physically smaller or takes less time to write than the actual file change, thus decreasing the chance of something going wrong while it's happening; (b) a single journal entry write can represent the two or more file system writes necessary to keep the system consistent; (c) the file system driver somehow 'prioritises' writes to the journal over normal file system writes; (d) the journal operates on some kind of transaction basis so that partially-written entries can be recognised as such and ignored.

These are all nice theories out of my head as to why journal writes are more reliable and more atomic than the actual file system writes. But I can't confirm these hunches anywhere. So what are the actual reasons (in, say, ext3, if a concrete example is needed)?
posted by chrismear to Computers & Internet (10 answers total) 1 user marked this as a favorite
I think the idea is:

1) you find a new location on the disk to write your new data
(if a failure happens here, you lose the new data)

2) You save the old location in the journal
(If a failure happens here, you lose the new data)

3) You update the actual file allocation table
(If a failure happens here, you can read the journal, and recover the old data, you lose the new data)

4) You update the journal to indicate that the data is written fully
(if a failure happens here, do the same thing as step three)

That's how I would implement a journaling file system, anyway.
posted by delmoi at 4:05 PM on April 10, 2007

I think you're right about transactional integrity. (your (b))

Also, Data can be lost after a system crash, but with the journaling switched on, you can easily compare the journal (from the last good entry) with the filesystem and quickly fix any errors.

I guess that because a journal doesn't have a random write element and it only contains one file, it'll have a simpler file system and be less likely to suffer corruption on a system crash. (i.e. There's no File Allocation Table and no chance that the data before the last marked good transaction is incorrect)
posted by seanyboy at 4:24 PM on April 10, 2007

Link for my second point
posted by seanyboy at 4:27 PM on April 10, 2007

Writing the journal is definitely susceptible to the same issues as writing data to the disk. In fact from the HD's point of view it's just data. Journaling doesn't protect your (new) data, it just prevents inconsistencies.

I only remember this vaguely from my DB course in uni, but the basic premise is as follows (and there are a whole bunch of various strategies). You write to the journal before committing to disk. Once the data has been safely confirmed as being on disk we erase the record in the journal.

If a failure occurs while we're writing to the journal, the original data is still in a consistent state (basically unchanged) so we pretend like everything's fine (except that the new, pre-crash data that we were writing is lost).

Summary: journaling moves the point of failure elsewhere, which prevents crashes from messing up your filesystem.

I'm not a filesystem designer, but googling should get you more information on implementation particulars. I guess you could also take a look at kernel source, if ext3 or ReiserFS or something like that interest you.
posted by aeighty at 4:46 PM on April 10, 2007

The whole point of the journalling is to always leave the file system in a consistent state. So, yes, you might loose the update but the file system will always be in a totally consistent state.
posted by gadha at 4:46 PM on April 10, 2007

Yeah, journaling filesystems are about keeping the _filesystem_ (that is, the database of where your data is on disk) consistant, not necessarily the data itself.

The optimal situation is this:
1 - You perform a write to the filesystem. Blocks are changed, block allocations change, whatever. The filesystem changes are written to the journal.
2 - Once it's safely in the journal, it's commited to the filesystem.

If power is lost while the changes are being written to the journal, then no big deal. It'll never try to replay partial journal entries.

If power is lost while it's commiting changes from the journal to the filesystem, then big deal, because it'll just replay that journal entry again the next time the filesystem is mounted.
posted by Laen at 4:50 PM on April 10, 2007

I think the answer (from the possibilities outlined in your original post) is D.

The key is transactional integrity. Not really data protection. If the power goes off before the data gets written from the disk's buffer to the platters, it's gone. This is seen as pretty much normal. (You want to protect your data from that, you need a UPS.)

What a journaling file system really protects against, are situations where the power is cut off in the middle of a write, and the filesystem gets left in some sort of unstable, where-the-hell-were-we state. By going through a process (journal: okay, here's what we're going to do; filesystem: actually do it; journal: okay, we did it) if the power does click off, you can go back and redo any half-completed operations, purge the filesystem of any half-written crap, and bring it back to a stable state.

The standard example is of deleting a file. Deleting a file is a two-step process, first you delete the file from the filesystem's structure, and then you mark the space that the file is taking up on disk as available so it can be reused. On a non-journaled system, if the plug gets pulled after the first step but before the second, then the file disappears but the space never gets reused. Over time, your drive shrinks. (There are worse things that can happen during other operations, including data corruption, but the delete example is easiest to visualize.) On a journaled system, a note is first made in the journal about the file to delete, and then the two-step delete process is executed, and then a note is made in the journal saying that it went through okay. If, as before, the power goes out during the middle of the delete process, the journal can be scanned and immediately recognized as an incomplete operation. The operation is retried, and the filesystem is happy -- nothing left half-done.

Of course it all comes at a cost: you're essentially duplicating the number of read/write operations in some cases (although not always duplicating), but that's an easy tradeoff on systems where stability is more important than raw throughput.
posted by Kadin2048 at 9:18 PM on April 10, 2007

One interesting alternative to journalling that you might want to look into is the Write-Anywhere File Layout (WAFL) designed by Network Appliance. Here's the surprisingly intelligible patent.
posted by flabdablet at 9:58 PM on April 10, 2007

From Wikipedia's entry for ZFS:

ZFS uses a copy-on-write, transactional object model. All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and an intent log is used when synchronous write semantics are required.

This sounds to me like journaling (guaranteed transactional integrity) for the data blocks, with file system metadata used for managing those blocks presumably being journaled like traditional journaling file systems. Thus transactional integrity for your file system and the data in your files. Yay.

IANAFileSystemGeek, though.
posted by BaxterG4 at 10:58 PM on April 10, 2007

As others have pointed out, it's all about consistency. The important point is that the journal is written linearly and it's possible to determine where the valid entries in the journal stop. Transactions (they're not really transactions in the ACID sense at all though) are designed in such a way that you can truncate the journal at ANY point due to power failure and the filesystem will be in a valid state once the journal has been replayed.

So entries in the journal might be something like "delete this directory entry", "move this file over there", "extend this file using free blocks from that group", etc. This is in contrast to a traditional filesystem where any one of those logical operations requires writes to multiple places on disc, hence the possibility of inconsistency.

Consider the case of deleting a file: you must remove the directory entry, decrement the pointer-count on the inode, if it reaches zero then you must add all the file's blocks to the free list. If the drive stops writing for lack of power at any point in that three-step process, you will have an inconsistent filesystem; the way in which it is inconsistent will often be completely arbitrary since discs reserve the right to re-order accesses as they see fit.

So you might end up with blocks that are allocated to a file yet also in the free list and get allocated to another file: bad because the files share their contents. Or you might have lost blocks that are in no files and not available as free. Or you might have a lost inode with no directories pointing to it. Or you might have an inode with a counter too low which will result in the file being freed when it still exists. Most of those cases are pretty catastrophic.

With the journalling filesystem, there's a single journal entry written which contains all the information describing the change; the entry is smaller than a block and (in most systems?) has a checksum so you know it's valid. That makes the writing of a journal entry atomic, which is what is required for consistency.

When the system crashes and restarts, the recover process inspects the journal and the filesystem state to see where modifications to the filesystem metadata (directories, inodes, freelists, etc) got to with respect to the journal. It can figure out which changes were made and which weren't; the process of performing the changes that haven't happened yet is called replaying the journal.

Most transactional filesystems give journalling for the metadata only, ensuring consistency but not preventing dataloss. If you look at a traditional database supporting ACID transactions, the log (journal) contains the content of the changes too (required for rollback if it's executing optimistically and finds an isolation violation on commit)... a filesystem can do the same but the space and time costs are pretty bad.
posted by polyglot at 4:34 AM on April 11, 2007 [1 favorite]

« Older Why am I convinced that "Highlights for Children"...   |   What video player can play avi files Newer »
This thread is closed to new comments.