I'm looking for datasets (or ideas for possible datasets), preferably live feeds of data, with which to stress test some analysis algorithms I've written.
The dataset needs to:
* Be either live (as in a source I can harvest data from as it's created) or pre-recorded.
* Cover a stream of changes over time about a large number of discrete objects.
* Accessible in as raw form as possible, preferably XML -- scraping inconsistently-tagged HTML files is a recipe for a waste of time.
* Free or fairly cheap -- a few hundred $ is okay, thousands of $ is not.
The algorithms use the stream of changes as input, and produce various statistics as the output. Each conceptual change needs to relate to an object's creation, modification or eventual deletion. An "object" may be a thermometer, or a sensor, or a person, or a car -- anything that can exist.
Each object must be expressible as a set of metadata covering as many of the following classifications as possible:
* Text: Free text, such as titles and descriptions.
* Hierarchical: For example, a "location" may be broken into a hierarchy of continents, countries, states, counties, cities, etc.
* Geographical: One or more geographical locations, expressed in either longitude/latitude coordinates or nautical coordinats.
* Scalar: For example, "size", "age", "weight", "length", "temperature" are scalar values.
Having some geographical component is relatively important, as the algorithms are geographical by nature; but if need be I can derive or randomize geographical locations from a non-geographical data.
There are many geographically-related datasets out there -- air temperatures, ocean buoys, population sizes, and so on. However, the ones I have found are always small and unchanging, and the amount of metadata is limited. For example, a bit of rainfall statistics from the 1800s will not suffice. Neither will a static database of US city populations.
To give you an idea of what I'm looking for, at one point, I worked on a project where we had access to the live stream of data coming from an oild field in the North Sea, a large, continuous mass of heterogenous data covering anything from drilling points to oil and gas extraction statistics, lots of deep and rapidly changing scientific information. Millions of changes. That's the kind of dataset I want. Unfortunately, I no longer have access to this feed.
I'm toying with different possibilities. For example, one idea I have is to monitor an IRC network, treating each user as an "object" and recording their names, login/logout times, and so on, and artificially mapping their IP address to a geographical location. Another is to collect a huge amount of RSS feeds and track that, although again the geographical component becomes artificial. And RSS feeds deal mostly with new articles, not changing ones. Yet another idea is to track stock markets, but I don't know where I can get raw stock feeds.
posted by gentle to computers & internet (6 comments total)
Or, using the sea/climatic data, you could interpolate and resample to get data points in between the sample points to fill out your data set.
posted by scalespace at 8:30 PM on June 12, 2005