Efficiently retrieving a file from an archive
August 27, 2009 4:25 PM   Subscribe

Is there a smarter tar or cpio out there, or a smarter way to archive, to efficiently retrieve a file stored in the archive?

I am using tar to archive a group of very large (multi-GB) bz2 files.

If I use tar -tf file.tar to list the files within the archive, this takes a very long time to complete (~10-15 minutes).

Likewise, cpio -t < file.cpio takes just as long to complete, plus or minus a few seconds.

Accordingly, retrieving a file from an archive (via tar -xf file.tar myFileOfInterest for example) is as slow.

Is there an archival method out there that keeps a readily available "catalog" with the archive, so that an individual file within the archive can be retrieved quickly?

For example, some kind of catalog that stores a pointer to a particular byte in the archive, as well as the size of the file to be retrieved (as well as any other filesystem-specific particulars).

Is there a tool (or argument to tar or cpio) that allows efficient retrieval of a file within the archive?
posted by Blazecock Pileon to Computers & Internet (20 answers total) 2 users marked this as a favorite
 
There is a newer utility called dar which is sort of a modernization of tar, aimed at disk archiving instead of tape archiving, which might be faster. It's not included in most *nix distributions by default, but I've had good luck compiling it (or maybe I just found precompiled binaries) on both Linux and Mac OS X.

I don't know for a fact that it's any faster to retrieve files than tar, but the archive format is different and is aimed at modern storage devices rather than linear tape, so it wouldn't surprise me if it was. It normally compresses files by default (unlike tar) but it has a commandline switch to exclude certain file extensions like .gz or .bz2.

If I get a chance tonight I'll run some tests and see if dar is faster to retrieve files from an archive.
posted by Kadin2048 at 5:05 PM on August 27, 2009


I guess my question is, why do you need to tar this data at all?

File transfers are only negligibly more difficult with directories (scp vs scp -r). You've already bzipped the data, so disk space isn't the issue. A good directory structure will allow you to find your data and access it easily, without any of the overhead of tar.

I'm interested from an academic standpoint to know if there is an answer, but unsure why you need it, I suppose.
posted by chrisamiller at 5:14 PM on August 27, 2009


Correction to the above, per the dar online manual:
dar can use compression. By default no compression is used.
So you don't need to do anything to exclude your bz2 files unless you explicitly turn on compression. I thought for some reason compression defaulted to on, but I was wrong. I think because I just always use it that way in my backup scripts. Anyway...
even using compression dar has not to read the whole backup to extract one file. This way if you just want to restore one file from a huge backup, the process will be much faster than using tar. Dar first reads the catalogue (i.e. the contents of the backup), then it goes directly to the location of the saved file(s) you want to restore and proceed to restoration. [...]
(Emph. mine)

So it sounds like it will do exactly what you want.
posted by Kadin2048 at 5:14 PM on August 27, 2009 [1 favorite]


Have you tried ZIP? The format has a directory so it should do what you want rather than the concatenated headers+files of tar and cpio.
posted by grouse at 5:18 PM on August 27, 2009


I'm interested from an academic standpoint to know if there is an answer, but unsure why you need it, I suppose.

Mainly for scalability and consistency. We're grabbing datasets (e.g. scores) from tables on UCSC. Data are separated by units of chromosomes. Presently we tar the compressed versions of these files.

Using tar allows us to store whole tables in one file, which gives all of us one point of reference for the archived data. We can retrieve and work with one chromosome's worth of data as needed, without touching the rest.
posted by Blazecock Pileon at 5:28 PM on August 27, 2009


Pretty sure both rar and zip do this (i.e. have a index at the beginning of the archive.)
posted by Rhomboid at 6:04 PM on August 27, 2009


A quick test, using the dar source tarball:

First, a single-file restore using tar:
$ time tar -xvf dar.tar dar-2.3.9/NEWS
dar-2.3.9/NEWS

real    0m0.101s
user    0m0.014s
sys     0m0.080s
Compared to dar:
$ time dar -O -x dar -g dar-2.3.9/NEWS
[output removed]

real    0m0.122s
user    0m0.075s
sys     0m0.037s
Then I tried it using a single file from a 315MB dataset consisting mostly of small files; again here's tar:
real    0m0.481s
user    0m0.056s
sys     0m0.411s
And then dar:
real    0m0.101s
user    0m0.058s
sys     0m0.036s
So as you increase the size of the archive, the difference between tar and dar starts to become more and more apparent. I suspect if you tried it with some multigigabyte files, you'd see a serious time savings.

Dar does have some quirks that make it not quite a drop-in replacement for tar; I find myself having to consult the manual every time I go to use it, but it has enough options that I've never really run into a situation where it couldn't do something that I wanted it to. (Its real advantage versus tar is in spanning archives across multiple disks/files, like if you were burning several GB of data to CD-R. If you do this, it's smart enough to only ask you for the 'slice' that contains the file you want to recover.)
posted by Kadin2048 at 6:29 PM on August 27, 2009


I'd like to test dar. Is it possible to extract a file to standard output? It seems possible to pipe the archive to standard output, but I can't see what options to use to extract a file to stdout.

BTW, I also tested 7z, which looks very promising.
posted by Blazecock Pileon at 6:37 PM on August 27, 2009


You can create a filesystem in a file, mount it via the loopback device, and use that as an archive. The process is the same as making an image to burn to CD, only you can dispense with burning to different media if you want. This is how most people package things for MacOS these days: as disk images that, when "opened," appear as a file hierarchy under /Volumes. I guess under Linux you would use some flavor of mkfs.
posted by fantabulous timewaster at 8:01 PM on August 27, 2009


I'm going to suggest that you consider using git and maybe zfs. I know, I know, that's not quite what you asked but I think it makes sense for large datasets. Here's why. Git gives a full history. Unless everyone is very disciplined about naming their archive files and being certain that every necessary bit is included your archive files are going to end up being a mess. If, and I know it's a big if, you have large data files whose contents change incrementally git will attempt to delta compression between changes. This will save disk space. I'm assuming, because disk space is cheap nowadays, that you're going to keep all the data on disk and online but you're just looking for a way to organize it. If that's the case, then putting stuff in a git repository, makes distribution much easier.

I've used git for source control, not for storing huge datasets so there might be hidden traps.

As soon as Zfs gets data deduplication I think it, it's cousins, and the fact that disks are about as cheap as traditional backup media, are going to change the way we think about backups. Zfs has already got snapshots, copy on write, compression, and the ability to just keep adding disks.

I've been thinking about this a lot for the past week so I may be answering my own questions rather the question you asked but I thought I'd throw this out there.
posted by rdr at 8:16 PM on August 27, 2009


I'd also consider compressing each chromosome (or unit of chromosomes) separately.
posted by rhizome at 9:11 PM on August 27, 2009


Blazecock Pileon: "I'd like to test dar. Is it possible to extract a file to standard output? It seems possible to pipe the archive to standard output, but I can't see what options to use to extract a file to stdout."

Humm...good question. I've never used it that way. It doesn't look like it does, at least from my reading of the manual. But it has some fairly complex features that I've never touched, involving dar_slave and client/server mode, that might be able to do it in some less-than-obvious way. But in general it looks pretty much built for file-oriented operation rather than stream-oriented.
posted by Kadin2048 at 9:38 PM on August 27, 2009


I'd also consider compressing each chromosome (or unit of chromosomes) separately.

To reiterate, the archive is made up of bz2 (bzip2) files.
posted by Blazecock Pileon at 9:58 PM on August 27, 2009


But in general it looks pretty much built for file-oriented operation rather than stream-oriented.

If that's the case, that's not going to be as helpful as other tools (we like to pipe stuff between commands as much as possible). But it might be useful for other work, so thanks for pointing it out!
posted by Blazecock Pileon at 10:00 PM on August 27, 2009


Two people said this already but I'll concur. Try zip -0 and unzip. zip -0 (that's a zero) will turn off the compression to avoid wasting time on that, given that your contents are already really well compressed. You can unzip a single file to stdout with unzip -p file.zip filetoextract. It keeps a catalog at the end of the file so should be efficient to get files out of large archives, hopefully only a few seeks. That's not to say that there aren't implementation bugs.
posted by sergent at 10:32 PM on August 27, 2009


rdiffbackup?

Really it sounds like you should just be using a directory structure and some tools to enforce it - rather than relying on tar to archive things.

If you are always pulling individual items out of the archives - they probably shouldn't be tarred up in the first place.

Version control (svn, git, etc) or just a plain directory structure with some supporting scripts to index, checksum, and so on would probably make more sense.

Or a database.
posted by TravellingDen at 11:04 PM on August 27, 2009


We already have a naming and organizational scheme for these files (which is tied closely to how our lab data browser operates) as well as packaging/unpackaging tools which are used to work with these bundles.

Adding version control and changing filesystems or naming schemes wouldn't solve the root of this specific problem and would likely introduce several new and larger headaches. I'll be honest and say that these three approaches are probably non-starters.

A database is great for random access and we use this for visualizing data, but for storage and performance reasons, lossy compression is used for some of the data put into the database. To get to the true data values we need to handle packaging of files that other institutions have available and we need to be able to use reasonably standard and/or open-source UNIX tools and procedures to do this, which motivates my question. (Additionally, filesystem access lets us reduce load on our already overburdened database.)

It sounds like an index-capable archival tool like 7z or zip may help solve this issue. Thanks to all for your advice!
posted by Blazecock Pileon at 12:12 AM on August 28, 2009


When I dealt with this sort of thing, I built a filter that split the input & independently bzipped those. It would produce files like largearchive.tar.aaa.bz2. Block headers let tar pick up in the middle. I'd split the files at cd size - 700m - & also let the filter tee into a "tar tvf" to keep a catalog. Restoring was a matter of piping the relevant files to bzip2 then tar.
posted by Pronoiac at 1:29 AM on August 28, 2009


The root of the problem is that the tar format does not maintain an index of the files in the archive; extracting "FileOfInterest" means seeking through the archive one file header at a time looking for the right file name. zip or 7zip archives have indexes, and will work much better. (FWIW, if you ditch the bz2 on the individual files and store them in a 7zip archive, there's a good chance you'll make some massive space savings; 7zip really shines when redundancy in data spans multiple files in the archive, which I suspect may be the case for you. And you'd get in the indexing win.)

Using tar allows us to store whole tables in one file, which gives all of us one point of reference for the archived data. We can retrieve and work with one chromosome's worth of data as needed, without touching the rest.

How does one directory per table not provide this functionality? Why the need to wrap all the files into one big file? I'll assume you've got your reasons, but it's not clear what they are.
posted by buxtonbluecat at 5:40 AM on August 28, 2009


The internet archive has a similar use case to you. They developed a format that they call ARC. See: ARC file format. They explicitly separate the index of files in the archive from the archive.

(Note that there are two archive formats with the extension .arc)
posted by bdc34 at 8:08 AM on August 28, 2009


« Older How to minimize computer em interference?   |   Maintaining Server Privacy Newer »
This thread is closed to new comments.