Publicly available data sets?
April 29, 2005 2:41 PM   Subscribe

I've undertaken some data compression research and have been looking for publicly available data sets of large integers without much success. My research is in the same general area as this paper, and I have contacted the authors but unfortunately their data is lost in the mists of time. Can anyone recommend any publicly available data sets of large integers?
posted by harmfulray to Computers & Internet (13 answers total)
posted by skwm at 2:50 PM on April 29, 2005

How about the digits of pi? Or did you mean sets of large integers, not large sets of integers?
posted by driveler at 2:51 PM on April 29, 2005

Why not just download a sufficiently random Random Number Generator API and generate your own? There ar eplenty available for C++, Java, etc.
posted by skwm at 2:52 PM on April 29, 2005

Thanks for your suggestions, but the problem with random numbers is that they are only compressible to the extent that they are not random. In other words, there's no redundant information that can be removed if the data are truly random. So they are not good candidates for evaluating the performance of compression algorithms. I'm hoping to find data sets that exhibit a reasonable amount of order so there will be a significant degree of compression across several algorithms.
posted by harmfulray at 3:02 PM on April 29, 2005

Why not download large datasets from the Census Bureau, or the National Election Studies? Etexts of the Bible? Just call each ASCII character an integer.
posted by ROU_Xenophobe at 3:06 PM on April 29, 2005

astronomy images? raw data are typically (i think) arrays of 4 byte values. i don't know whether something like that, which has a pretty trivial conversion to "large integer" is sufficiently close, or whether you really need individual values. is the latter, what kind of size?
posted by andrew cooke at 3:06 PM on April 29, 2005

If you're at the point where you have to run your method on an experimental data set, (and you havn't done this already), I'd suggest that you first run it on artifical data sets with known parameters.

For instance, it's not too hard to make a memory-driven process (for instance, a random walk) that will have known characteristics in terms of information content, scaling factors and spectra (brown, pink, etc.). If I were reading your paper, I'd want to see the relationship between the known statistical qualities of a data set and its compressability. Quantities of 'memory' such as fractal dimension are good for this.

But... that's not what you were asking. Sorry.

For what you were asking, I'd recommend an astronomical data set (which is a set of poisson-distributed count variables); they're also a good example of intractably large datasets. Usually, these are available online somewhere (although possible already compressed into int32 or somesuch).

Or, you could cobble together some weather data from here:

Or here is ~40mb (compressed) of coastal weather data, with a bunch of integers that you could grep out:

Hope that helps a bit.
posted by metaculpa at 3:28 PM on April 29, 2005

probably being pedantic or misunderstanding here, but "poisson distributed count variables" seems to imply a single underlying rate across the image and ignores other sources of noise than photon counts. in practice, raw optical astronomy images have structure (due to limitations in optics and electronics, and the fact that you look at "interesting" things) and additional noise sources (typically from "read out noise", usually considered as gaussian). also, of course, images on web sites like "astronomy picture of the day" are processed, not raw, and so are going to be completely different - for raw images you need to go to a "real" astronomy source.
posted by andrew cooke at 3:41 PM on April 29, 2005


of course, you may have to do a little typing.
posted by felix at 3:58 PM on April 29, 2005

Why don't you take one of the standard text corpuses, such as those listed here, replace every unique word with a describing integer, pack the result, and go from there? You'll have statistically meaningful results and an ability to compare against competing algorithms.
posted by effugas at 4:28 PM on April 29, 2005

Well I'd suggest genome sequences. But then again, I would.
posted by grouse at 5:37 PM on April 29, 2005

Thanks, everyone, for your suggestions! I am going to follow effugas' suggestion, because the algorithms I'll be using have the property that compression is best for lower integers. If I map words to integers such that higher frequency words map to lower integers, I should get good compression and a good basis for comparison with other algorithms. Again, many thanks to all who responded.
posted by harmfulray at 7:45 PM on April 29, 2005


Let me know what you build, will ya? Especially if it has any interesting graphable implications.


Got any massive graphs lying around?
posted by effugas at 8:24 PM on April 29, 2005

« Older Name my ferrets   |   Moving to Nashville Newer »
This thread is closed to new comments.