Data compression question
February 2, 2006 3:34 PM   Subscribe

Is it possible to compress a large file while it is writing to disk? My IT shop is developing a program to translate a huge binary file of data to a field delimited text file suitable to be pulled into standard data analysis tools. I know applications can read and manipulate data from a compressed file without explicitly decompressing it. But can translation and compression occur simultaneously or must the whole file be written out and then compressed? This makes a big difference on the data storage requirements. Thanks
posted by queue_strategy to Computers & Internet (8 answers total)
 
first off, if we're going to do anything more than theory here we're going to need some more detail. OS & version for starters.

For my webhost when I want to backup I use..

tar -cvzf backup.tar.gz *

which create a tar of all my files and subdirectories and compresses it with gzip all in the same wonderful process.

I don't know a lot about *nix, but it seems to be the case that you can plug gzip into a lot of on the fly processes. I've no idea how to do it in your case though.
posted by tiamat at 3:46 PM on February 2, 2006


Yes, it's certainly possible to compress a file while writing. For instance, gzip and bzip2 both work this way. Basically, they consider the data in "blocks", compressing the blocks as they go along.
posted by pocams at 3:49 PM on February 2, 2006


Sure, dynamic or streaming compression of files has been around a long time. MS-DOS did it for entire disk drives with its DriveSpace and DoubleSpace utilities back in the early 90's. Try a compression library with an API which supports block-level compression for reading and writing files. I used to use one several years ago -- I think it was zlib, but zlib's home site is down now.

Since you control the data writes, you should be in good shape for using a library which supports dynamic compression. SCZ, for example, supports most common operating systems and is open-source.
posted by mdevore at 4:07 PM on February 2, 2006


When you say a "huge binary file" -- how huge? Do you need to execute the file, or just read and compress on the fly? The former could increase the complexity of the program you're developing by a few orders of magnitude. If it's just the latter... mdevore's answer is a good way to go.
posted by Civil_Disobedient at 4:45 PM on February 2, 2006


Best answer: Actually there are differences in how some compression methods deal with streaming (i.e. where you cannot seek forward/back in the input, you can only receive the next sequential block.) gzip is capable of this, but bzip2 is not -- it requires the whole file available before it can start. This is why many network protocols use gzip (deflate) compression. Even though the bzip2 algorithm will achieve much better compression ratios, this requirement of being able to seek the input as a file instead of a stream is a drawback in those scenarios.
posted by Rhomboid at 5:00 PM on February 2, 2006


Yes. In particular, under Unix, using gzip...

yourprogram < yourinputfile 2> errors.log | gzip -cn > outputfile.gz

...presuming yourprogram honors STDIN, STDOUT and STDERR. This runs yourinputfile into yourprogram, writes all output to STDERR to "errors.log", and writes the output into gzip, which compresses it and write the result to "outputfile.gz".

Technically, you don't need the "-cn", which says "output to STDOUT" and "don't store name information" when you're piping in, but it is more certain -- one never knows where the bugs are in various implementations. In my case (FreeBSD 6.0-RELEASE), it worked correctly with and without the switches.

Finally, in some very specific cases (data with very high amounts of redundancy) , compress will generate a smaller output that gzip. Same command line, but use "compress -c" instead of "gzip -cn", and change the final extension to .Z, instead of .gz
posted by eriko at 6:29 PM on February 2, 2006


Why don't you convert it into the new format in small chunks and then compress those chunks? Does everything have to end up in one file or can you make smaller files and import them as a group?
posted by voidcontext at 6:36 PM on February 2, 2006


Response by poster: Thanks for the comments. The OS is a flavor of UNIX, the uncompressed sizes are 50-100 gigs and there will be multiple files. The comments about gzip vs bzip2 are especially helpful -- we may end up trading some compression efficiency for immediate compression.
posted by queue_strategy at 6:05 AM on February 3, 2006


« Older So much for "plug and play"   |   Don't drink the water Newer »
This thread is closed to new comments.