Live or recorded data feeds
June 12, 2005 8:20 PM   Subscribe

I'm looking for datasets (or ideas for possible datasets), preferably live feeds of data, with which to stress test some analysis algorithms I've written.

The dataset needs to:

* Be either live (as in a source I can harvest data from as it's created) or pre-recorded.
* Cover a stream of changes over time about a large number of discrete objects.
* Accessible in as raw form as possible, preferably XML -- scraping inconsistently-tagged HTML files is a recipe for a waste of time.
* Free or fairly cheap -- a few hundred $ is okay, thousands of $ is not.

The algorithms use the stream of changes as input, and produce various statistics as the output. Each conceptual change needs to relate to an object's creation, modification or eventual deletion. An "object" may be a thermometer, or a sensor, or a person, or a car -- anything that can exist.

Each object must be expressible as a set of metadata covering as many of the following classifications as possible:

* Text: Free text, such as titles and descriptions.
* Hierarchical: For example, a "location" may be broken into a hierarchy of continents, countries, states, counties, cities, etc.
* Geographical: One or more geographical locations, expressed in either longitude/latitude coordinates or nautical coordinats.
* Scalar: For example, "size", "age", "weight", "length", "temperature" are scalar values.

Having some geographical component is relatively important, as the algorithms are geographical by nature; but if need be I can derive or randomize geographical locations from a non-geographical data.

There are many geographically-related datasets out there -- air temperatures, ocean buoys, population sizes, and so on. However, the ones I have found are always small and unchanging, and the amount of metadata is limited. For example, a bit of rainfall statistics from the 1800s will not suffice. Neither will a static database of US city populations.

To give you an idea of what I'm looking for, at one point, I worked on a project where we had access to the live stream of data coming from an oild field in the North Sea, a large, continuous mass of heterogenous data covering anything from drilling points to oil and gas extraction statistics, lots of deep and rapidly changing scientific information. Millions of changes. That's the kind of dataset I want. Unfortunately, I no longer have access to this feed.

I'm toying with different possibilities. For example, one idea I have is to monitor an IRC network, treating each user as an "object" and recording their names, login/logout times, and so on, and artificially mapping their IP address to a geographical location. Another is to collect a huge amount of RSS feeds and track that, although again the geographical component becomes artificial. And RSS feeds deal mostly with new articles, not changing ones. Yet another idea is to track stock markets, but I don't know where I can get raw stock feeds.
posted by gentle to Computers & Internet (6 answers total) 1 user marked this as a favorite
You don't mention it in your description, but have you tried generating a large body of synthetic data and using that to test your algorithm?

Or, using the sea/climatic data, you could interpolate and resample to get data points in between the sample points to fill out your data set.
posted by scalespace at 8:30 PM on June 12, 2005

WebTrends has about 60 megs of sample web stats. That might be useful.
posted by devilsbrigade at 8:51 PM on June 12, 2005

Any way you could convert an online radio station to usable data?
posted by null terminated at 9:34 PM on June 12, 2005

Response by poster: scalespace, I use synthetic data for some tests. But for this particular test, the idea is to simulate real-world behaviour as accurately as possible without actually exposing the code to the real world. Even the most well-designed synthetic data will not provide the true randomness and coverage that I'm looking for -- the data combinations that will trip up this code will be things I didn't consider beforehand.

I don't see what an online radio station can give me. I'm not looking for a continuous stream of raw bytes; if I were, I could scrape anything, including my own hard drive.
posted by gentle at 7:30 AM on June 13, 2005

The NASDAQ apparently has some sort of web service whereby you can grab 10 quotes at a time, on a 15-minute delay. Maybe just cycle through sets of 10 quotes repeatedly, treating each ticker symbol as an object, and the price as a property?

Also, maybe you could use weather data, by treating each ZIP code as an object, and use historical, current, future low/hi temperature data? I would recommend using the NOAA's 108-year historical data.

Just some thoughts, as I'm not sure if I fully grok your request. Good luck though!
posted by chota at 8:52 AM on June 13, 2005

Response by poster: Good ideas, chota, though by cycling through quotes at 15-minute intervals it would take forever to generate any useful volume of data for stress testing. The weather data might be useful, though it looks rather small.

What do you not grok about my request? I would be happy to amplify.
posted by gentle at 9:03 AM on June 13, 2005

« Older Prank or not?   |   Microsoft Office Keyboard on an iBook Newer »
This thread is closed to new comments.