Efficiently retrieving a file from an archive
August 27, 2009 4:25 PM Subscribe
Is there a smarter
I am using
If I use
Likewise,
Accordingly, retrieving a file from an archive (via
Is there an archival method out there that keeps a readily available "catalog" with the archive, so that an individual file within the archive can be retrieved quickly?
For example, some kind of catalog that stores a pointer to a particular byte in the archive, as well as the size of the file to be retrieved (as well as any other filesystem-specific particulars).
Is there a tool (or argument to
tar
or cpio
out there, or a smarter way to archive, to efficiently retrieve a file stored in the archive?I am using
tar
to archive a group of very large (multi-GB) bz2
files.If I use
tar -tf file.tar
to list the files within the archive, this takes a very long time to complete (~10-15 minutes).Likewise,
cpio -t < file.cpio
takes just as long to complete, plus or minus a few seconds.Accordingly, retrieving a file from an archive (via
tar -xf file.tar myFileOfInterest
for example) is as slow.Is there an archival method out there that keeps a readily available "catalog" with the archive, so that an individual file within the archive can be retrieved quickly?
For example, some kind of catalog that stores a pointer to a particular byte in the archive, as well as the size of the file to be retrieved (as well as any other filesystem-specific particulars).
Is there a tool (or argument to
tar
or cpio
) that allows efficient retrieval of a file within the archive?>I guess my question is, why do you need to tar this data at all?
File transfers are only negligibly more difficult with directories (scp vs scp -r). You've already bzipped the data, so disk space isn't the issue. A good directory structure will allow you to find your data and access it easily, without any of the overhead of tar.
I'm interested from an academic standpoint to know if there is an answer, but unsure why you need it, I suppose.
posted by chrisamiller at 5:14 PM on August 27, 2009
File transfers are only negligibly more difficult with directories (scp vs scp -r). You've already bzipped the data, so disk space isn't the issue. A good directory structure will allow you to find your data and access it easily, without any of the overhead of tar.
I'm interested from an academic standpoint to know if there is an answer, but unsure why you need it, I suppose.
posted by chrisamiller at 5:14 PM on August 27, 2009
Correction to the above, per the dar online manual:
So it sounds like it will do exactly what you want.
posted by Kadin2048 at 5:14 PM on August 27, 2009 [1 favorite]
dar can use compression. By default no compression is used.So you don't need to do anything to exclude your bz2 files unless you explicitly turn on compression. I thought for some reason compression defaulted to on, but I was wrong. I think because I just always use it that way in my backup scripts. Anyway...
even using compression dar has not to read the whole backup to extract one file. This way if you just want to restore one file from a huge backup, the process will be much faster than using tar. Dar first reads the catalogue (i.e. the contents of the backup), then it goes directly to the location of the saved file(s) you want to restore and proceed to restoration. [...](Emph. mine)
So it sounds like it will do exactly what you want.
posted by Kadin2048 at 5:14 PM on August 27, 2009 [1 favorite]
Have you tried ZIP? The format has a directory so it should do what you want rather than the concatenated headers+files of tar and cpio.
posted by grouse at 5:18 PM on August 27, 2009
posted by grouse at 5:18 PM on August 27, 2009
Response by poster: I'm interested from an academic standpoint to know if there is an answer, but unsure why you need it, I suppose.
Mainly for scalability and consistency. We're grabbing datasets (e.g. scores) from tables on UCSC. Data are separated by units of chromosomes. Presently we tar the compressed versions of these files.
Using tar allows us to store whole tables in one file, which gives all of us one point of reference for the archived data. We can retrieve and work with one chromosome's worth of data as needed, without touching the rest.
posted by Blazecock Pileon at 5:28 PM on August 27, 2009
Mainly for scalability and consistency. We're grabbing datasets (e.g. scores) from tables on UCSC. Data are separated by units of chromosomes. Presently we tar the compressed versions of these files.
Using tar allows us to store whole tables in one file, which gives all of us one point of reference for the archived data. We can retrieve and work with one chromosome's worth of data as needed, without touching the rest.
posted by Blazecock Pileon at 5:28 PM on August 27, 2009
Pretty sure both rar and zip do this (i.e. have a index at the beginning of the archive.)
posted by Rhomboid at 6:04 PM on August 27, 2009
posted by Rhomboid at 6:04 PM on August 27, 2009
A quick test, using the dar source tarball:
First, a single-file restore using tar:
Dar does have some quirks that make it not quite a drop-in replacement for tar; I find myself having to consult the manual every time I go to use it, but it has enough options that I've never really run into a situation where it couldn't do something that I wanted it to. (Its real advantage versus tar is in spanning archives across multiple disks/files, like if you were burning several GB of data to CD-R. If you do this, it's smart enough to only ask you for the 'slice' that contains the file you want to recover.)
posted by Kadin2048 at 6:29 PM on August 27, 2009
First, a single-file restore using tar:
$ time tar -xvf dar.tar dar-2.3.9/NEWS dar-2.3.9/NEWS real 0m0.101s user 0m0.014s sys 0m0.080sCompared to dar:
$ time dar -O -x dar -g dar-2.3.9/NEWS [output removed] real 0m0.122s user 0m0.075s sys 0m0.037sThen I tried it using a single file from a 315MB dataset consisting mostly of small files; again here's tar:
real 0m0.481s user 0m0.056s sys 0m0.411sAnd then dar:
real 0m0.101s user 0m0.058s sys 0m0.036sSo as you increase the size of the archive, the difference between tar and dar starts to become more and more apparent. I suspect if you tried it with some multigigabyte files, you'd see a serious time savings.
Dar does have some quirks that make it not quite a drop-in replacement for tar; I find myself having to consult the manual every time I go to use it, but it has enough options that I've never really run into a situation where it couldn't do something that I wanted it to. (Its real advantage versus tar is in spanning archives across multiple disks/files, like if you were burning several GB of data to CD-R. If you do this, it's smart enough to only ask you for the 'slice' that contains the file you want to recover.)
posted by Kadin2048 at 6:29 PM on August 27, 2009
Response by poster: I'd like to test
BTW, I also tested
posted by Blazecock Pileon at 6:37 PM on August 27, 2009
dar
. Is it possible to extract a file to standard output? It seems possible to pipe the archive to standard output, but I can't see what options to use to extract a file to stdout.BTW, I also tested
7z
, which looks very promising.posted by Blazecock Pileon at 6:37 PM on August 27, 2009
You can create a filesystem in a file, mount it via the loopback device, and use that as an archive. The process is the same as making an image to burn to CD, only you can dispense with burning to different media if you want. This is how most people package things for MacOS these days: as disk images that, when "opened," appear as a file hierarchy under /Volumes. I guess under Linux you would use some flavor of mkfs.
posted by fantabulous timewaster at 8:01 PM on August 27, 2009
posted by fantabulous timewaster at 8:01 PM on August 27, 2009
I'm going to suggest that you consider using git and maybe zfs. I know, I know, that's not quite what you asked but I think it makes sense for large datasets. Here's why. Git gives a full history. Unless everyone is very disciplined about naming their archive files and being certain that every necessary bit is included your archive files are going to end up being a mess. If, and I know it's a big if, you have large data files whose contents change incrementally git will attempt to delta compression between changes. This will save disk space. I'm assuming, because disk space is cheap nowadays, that you're going to keep all the data on disk and online but you're just looking for a way to organize it. If that's the case, then putting stuff in a git repository, makes distribution much easier.
I've used git for source control, not for storing huge datasets so there might be hidden traps.
As soon as Zfs gets data deduplication I think it, it's cousins, and the fact that disks are about as cheap as traditional backup media, are going to change the way we think about backups. Zfs has already got snapshots, copy on write, compression, and the ability to just keep adding disks.
I've been thinking about this a lot for the past week so I may be answering my own questions rather the question you asked but I thought I'd throw this out there.
posted by rdr at 8:16 PM on August 27, 2009
I've used git for source control, not for storing huge datasets so there might be hidden traps.
As soon as Zfs gets data deduplication I think it, it's cousins, and the fact that disks are about as cheap as traditional backup media, are going to change the way we think about backups. Zfs has already got snapshots, copy on write, compression, and the ability to just keep adding disks.
I've been thinking about this a lot for the past week so I may be answering my own questions rather the question you asked but I thought I'd throw this out there.
posted by rdr at 8:16 PM on August 27, 2009
I'd also consider compressing each chromosome (or unit of chromosomes) separately.
posted by rhizome at 9:11 PM on August 27, 2009
posted by rhizome at 9:11 PM on August 27, 2009
Blazecock Pileon: "I'd like to test
Humm...good question. I've never used it that way. It doesn't look like it does, at least from my reading of the manual. But it has some fairly complex features that I've never touched, involving dar_slave and client/server mode, that might be able to do it in some less-than-obvious way. But in general it looks pretty much built for file-oriented operation rather than stream-oriented.
posted by Kadin2048 at 9:38 PM on August 27, 2009
dar
. Is it possible to extract a file to standard output? It seems possible to pipe the archive to standard output, but I can't see what options to use to extract a file to stdout."Humm...good question. I've never used it that way. It doesn't look like it does, at least from my reading of the manual. But it has some fairly complex features that I've never touched, involving dar_slave and client/server mode, that might be able to do it in some less-than-obvious way. But in general it looks pretty much built for file-oriented operation rather than stream-oriented.
posted by Kadin2048 at 9:38 PM on August 27, 2009
Response by poster: I'd also consider compressing each chromosome (or unit of chromosomes) separately.
To reiterate, the archive is made up of
posted by Blazecock Pileon at 9:58 PM on August 27, 2009
To reiterate, the archive is made up of
bz2
(bzip2) files.posted by Blazecock Pileon at 9:58 PM on August 27, 2009
Response by poster: But in general it looks pretty much built for file-oriented operation rather than stream-oriented.
If that's the case, that's not going to be as helpful as other tools (we like to pipe stuff between commands as much as possible). But it might be useful for other work, so thanks for pointing it out!
posted by Blazecock Pileon at 10:00 PM on August 27, 2009
If that's the case, that's not going to be as helpful as other tools (we like to pipe stuff between commands as much as possible). But it might be useful for other work, so thanks for pointing it out!
posted by Blazecock Pileon at 10:00 PM on August 27, 2009
Two people said this already but I'll concur. Try zip -0 and unzip. zip -0 (that's a zero) will turn off the compression to avoid wasting time on that, given that your contents are already really well compressed. You can unzip a single file to stdout with unzip -p file.zip filetoextract. It keeps a catalog at the end of the file so should be efficient to get files out of large archives, hopefully only a few seeks. That's not to say that there aren't implementation bugs.
posted by sergent at 10:32 PM on August 27, 2009
posted by sergent at 10:32 PM on August 27, 2009
rdiffbackup?
Really it sounds like you should just be using a directory structure and some tools to enforce it - rather than relying on tar to archive things.
If you are always pulling individual items out of the archives - they probably shouldn't be tarred up in the first place.
Version control (svn, git, etc) or just a plain directory structure with some supporting scripts to index, checksum, and so on would probably make more sense.
Or a database.
posted by TravellingDen at 11:04 PM on August 27, 2009
Really it sounds like you should just be using a directory structure and some tools to enforce it - rather than relying on tar to archive things.
If you are always pulling individual items out of the archives - they probably shouldn't be tarred up in the first place.
Version control (svn, git, etc) or just a plain directory structure with some supporting scripts to index, checksum, and so on would probably make more sense.
Or a database.
posted by TravellingDen at 11:04 PM on August 27, 2009
Response by poster: We already have a naming and organizational scheme for these files (which is tied closely to how our lab data browser operates) as well as packaging/unpackaging tools which are used to work with these bundles.
Adding version control and changing filesystems or naming schemes wouldn't solve the root of this specific problem and would likely introduce several new and larger headaches. I'll be honest and say that these three approaches are probably non-starters.
A database is great for random access and we use this for visualizing data, but for storage and performance reasons, lossy compression is used for some of the data put into the database. To get to the true data values we need to handle packaging of files that other institutions have available and we need to be able to use reasonably standard and/or open-source UNIX tools and procedures to do this, which motivates my question. (Additionally, filesystem access lets us reduce load on our already overburdened database.)
It sounds like an index-capable archival tool like 7z or zip may help solve this issue. Thanks to all for your advice!
posted by Blazecock Pileon at 12:12 AM on August 28, 2009
Adding version control and changing filesystems or naming schemes wouldn't solve the root of this specific problem and would likely introduce several new and larger headaches. I'll be honest and say that these three approaches are probably non-starters.
A database is great for random access and we use this for visualizing data, but for storage and performance reasons, lossy compression is used for some of the data put into the database. To get to the true data values we need to handle packaging of files that other institutions have available and we need to be able to use reasonably standard and/or open-source UNIX tools and procedures to do this, which motivates my question. (Additionally, filesystem access lets us reduce load on our already overburdened database.)
It sounds like an index-capable archival tool like 7z or zip may help solve this issue. Thanks to all for your advice!
posted by Blazecock Pileon at 12:12 AM on August 28, 2009
When I dealt with this sort of thing, I built a filter that split the input & independently bzipped those. It would produce files like largearchive.tar.aaa.bz2. Block headers let tar pick up in the middle. I'd split the files at cd size - 700m - & also let the filter tee into a "tar tvf" to keep a catalog. Restoring was a matter of piping the relevant files to bzip2 then tar.
posted by Pronoiac at 1:29 AM on August 28, 2009
posted by Pronoiac at 1:29 AM on August 28, 2009
The root of the problem is that the tar format does not maintain an index of the files in the archive; extracting "FileOfInterest" means seeking through the archive one file header at a time looking for the right file name. zip or 7zip archives have indexes, and will work much better. (FWIW, if you ditch the bz2 on the individual files and store them in a 7zip archive, there's a good chance you'll make some massive space savings; 7zip really shines when redundancy in data spans multiple files in the archive, which I suspect may be the case for you. And you'd get in the indexing win.)
Using tar allows us to store whole tables in one file, which gives all of us one point of reference for the archived data. We can retrieve and work with one chromosome's worth of data as needed, without touching the rest.
How does one directory per table not provide this functionality? Why the need to wrap all the files into one big file? I'll assume you've got your reasons, but it's not clear what they are.
posted by buxtonbluecat at 5:40 AM on August 28, 2009
Using tar allows us to store whole tables in one file, which gives all of us one point of reference for the archived data. We can retrieve and work with one chromosome's worth of data as needed, without touching the rest.
How does one directory per table not provide this functionality? Why the need to wrap all the files into one big file? I'll assume you've got your reasons, but it's not clear what they are.
posted by buxtonbluecat at 5:40 AM on August 28, 2009
The internet archive has a similar use case to you. They developed a format that they call ARC. See: ARC file format. They explicitly separate the index of files in the archive from the archive.
(Note that there are two archive formats with the extension .arc)
posted by bdc34 at 8:08 AM on August 28, 2009
(Note that there are two archive formats with the extension .arc)
posted by bdc34 at 8:08 AM on August 28, 2009
This thread is closed to new comments.
I don't know for a fact that it's any faster to retrieve files than tar, but the archive format is different and is aimed at modern storage devices rather than linear tape, so it wouldn't surprise me if it was. It normally compresses files by default (unlike tar) but it has a commandline switch to exclude certain file extensions like .gz or .bz2.
If I get a chance tonight I'll run some tests and see if dar is faster to retrieve files from an archive.
posted by Kadin2048 at 5:05 PM on August 27, 2009