Downloadable sets of data?
October 30, 2006 3:28 AM   Subscribe

DoesThisWebsiteExistFilter: Are there any sites that have downloadable sets of data in easily-parseable formats? For example, a list of all US presidents? Or the name of every country's capital? And then have the data available in different formats (XML, CSV, or an on-the-fly generated SQL Insert query)?

I hope I asked that question correctly. Basically, I'd like a site where I can download sets of data in the aforementioned formats, for inclusion in other sites, databases, or excel spreadsheets, or wherever you need a complete set of data.

If this doesn't exist, I think I've just found myself a lil coding project :)
posted by slater to Computers & Internet (22 answers total) 4 users marked this as a favorite
 
For any given piece of data, about the only answer is "use search engines and hope". I seriously doubt that there's any single site, or organized group of sites, which exist with the mission of putting together the kind of organized data you're talking about in any kind of standard format suitable for reading into database programs.
posted by Steven C. Den Beste at 3:46 AM on October 30, 2006


Mmm... wikipedia's use of templates means that the code to screen-scrape a list from it should be relatively simple. But figuring out what the list means... much harder.

It sounds like "toolset to parse structured wikipedia data" will probably get you maximum utility for minimum effort.
posted by Leon at 4:27 AM on October 30, 2006


It's a really weird question to ask.

People generally don't think in terms of "hey, I'm going to make a whole bunch of DATA available!" ... people make their data available.

There are hundreds of thousands - maybe millions - of sites out there with XML or CSV files containing datasets. Most of them are datasets that wouldn't be remotely interesting to anyone.

For datasets that might be remotely interesting to at least a moderately large subset of people, I'd try census.gov, usgs.gov, noaa.gov.
posted by dmd at 4:40 AM on October 30, 2006


People actually get paid good money to cull data for any given set.

Feel free to email me if you want any specific help. I am googling for practice database and the like, because students must have this stuff for comp sci classes, but nothing's turning up.
posted by shownomercy at 4:48 AM on October 30, 2006


Response by poster: dmd: i'd just like to make stuff like that available. Why SHOULDN'T it be available in different formats? If there is no such site, I'll do it, make it possible for users to upload their own datasets, and make it all available under some open source license or Creative Commons or some-such.

shownomercy: It's funny you should mention students, the question a friend of mine originally asked was "are there any open-source databases of french verbs".
posted by slater at 5:31 AM on October 30, 2006


Sounds like you came up with a possible business idea, get cracking now and you could make millions : )
posted by crewshell at 5:50 AM on October 30, 2006


I have not seen this done before. When I need this kind of data, I search Google for something like "iso country codes csv", and then find someone who just has that one thing. Does the job most of the time.
posted by smackfu at 6:08 AM on October 30, 2006


Ah, french verbs are basically available: http://machaut.uchicago.edu/

Be aware of possible liability problems here; you don't know where people are getting their data from. Pretty easy to find and copy copyrighted info..
posted by shownomercy at 6:23 AM on October 30, 2006


Oh man sorry about that link http://machaut.uchicago.edu/ haha.
posted by shownomercy at 6:24 AM on October 30, 2006


Response by poster: smackfu: wouldn't it be great if you had it all on one site? ;)

shownomercy: Yeh, i'd obviously have to have some kind of flagging option like on the blue...
posted by slater at 6:34 AM on October 30, 2006


Baseball Databank -- nice big db of easily understandable (but fairly rich and complex) baseball statistical information.
posted by fet at 8:39 AM on October 30, 2006


There are definitely sites that make such information available (notably U.S. Gov sources such as the USGS), but keep your eyes open for a forthcoming project from IBM's visual communication lab called "Many Eyes", which aims to be a sort of Wikipedia / clearing house for visual expressions of user-submitted data sets. Thing gazetteer, census, BLS type stuff.
posted by migurski at 10:49 AM on October 30, 2006


Microformats and the semantic web are two ways people are trying to make this sort of thing more common by providing standardized formats like RDF. Follow those links for examples, including a project to add machine-readable information to Wikipedia's software.
posted by mbrubeck at 11:48 AM on October 30, 2006


Google Base is Google's project for publishing machine-readable and queryable datasets.
posted by mbrubeck at 11:50 AM on October 30, 2006


Best answer: 1. There are too many different kinds of data for any one site to hope to encompass them all.

2. Most data isn't static. Somebody has to keep it up to date. This is a lot of work. A site that just trusts random people to 'upload their datasets' is going to contain a lot of out-of-date, overlapping, and just plain incorrect information, reducing its usefulness to, well, not very.

3. What format a given dataset comes in is pretty much a red herring; the important part is the data itself. Simple data is trivial to convert from one format to another, so this type of site wouldn't be needed. Structuring complex data requires making assumptions about how the data will be used, so this type of site wouldn't be feasible. (There are a zillion different ways to represent a given set of data in XML; it depends on the schema you want to use. Same thing for SQL statements; there it depends on the table structures. Sure, you could arbitrarily pick one SQL table description, or make up your own XML schema for each dataset, but most of the time people will have to convert from that to what they really need anyway.) So there really isn't much added value in a site that offers arbitrary data in multiple formats.


To sum this up: if I need a particular dataset, I'm going to go to a source I can trust to be accurate, which is going to be a source that specializes in maintaining that particular dataset (or for cases like US census data, to the canonical source for that dataset). I'm not going to go to a site that contains a hodgepodge of randomly collected data of dubious provenance and legality and hope they happen to have something that resembles what I'm looking for.
posted by ook at 12:14 PM on October 30, 2006


Best answer: Great idea slater. even if this was just limited to simple sets of data - it would rock. e.g. How many people have written some ecommerce thingy and manually typed in all 50 US states?

It could also have country codes, states, provinces, phone area codes, zip codes...

i'd love to see the data pumped out in different formats. e.g. XML, YAML, c/java/PHP arrays, etc.
posted by kamelhoecker at 1:23 PM on October 30, 2006


Enormous amounts of data is freely downloadable from IMDB.

Essentially their entire database of movies, actors, actresses, crew and so on -- but not the tables that link it all together -- can be downloaded via FTP.
posted by AmbroseChapel at 3:32 PM on October 30, 2006


Response by poster: ook: Yes, trust would be a big problem, and one I'm well aware of. Obviously, it would require some kind of rating system and an admin (me) overlooking all that and making sure people aren't gaming the system by making duplicate accounts and giving a joke list of 51 US states a 10/10 rating.

kamelhoecker: That's precisely the reason I want to do it. I've sat many a time trying to find all the US States, or all cantons in Switzerland, and for such data that changes maybe once a millennium, this might work.

To everyone offering links to available data:
Thank you, but I'm not sure if you understand: i *know* the data is available, but its either some niche (baseball stats?!) or just too much (imdb). I'm looking into offering SMALL amounts of data (list of US states, list of european countries, etc.), not world census data that is, theoretically, always changing.
posted by slater at 9:14 PM on October 30, 2006


I totally don't understand the question now.

You want to offer only small, noon-changing sets of data (the list of 'countries in Europe' never changes!) -- what's the point of the exercise? What's the use case where someone says "I need X" and someone else says "go to slater's site, they're sure to have it"? What does the potential user want the data for?
posted by AmbroseChapel at 12:58 AM on October 31, 2006


Oops -- "non" changing.
posted by AmbroseChapel at 12:59 AM on October 31, 2006


Response by poster: OK, most data will change over time, but obviously the site will be more succesful with SLOW-changing data. Another example would be a list of all US presidents up until now. Maybe the site should have an easy way of updating existing content.

And the potential user wants the data for what I mentioned in the initial question:

"[F]or inclusion in other sites, databases, or excel spreadsheets" etc.
posted by slater at 9:02 AM on October 31, 2006


Either I'm short of sleep/vitamins/something else, or this question still doesn't make sense.

Or rather, you have some entirely arbitrary category of "information which doesn't change very much, very often, and which isn't very long and isn't very detailed and isn't very obscure" which makes sense to you, but which you don't seem to be able to communicate too well to others. Like porn, you don't know what the definition is, but you know it when you see it.

People offer you sources of information, but there's too much of it, or it's too obscure.

Lists of states, countries and presidents would apparently qualify, despite the fact that the president is going to change shortly, and the information from IMDB (which would contain for instance, all Best Movie Oscars) doesn't qualify, but you don't say why.

And you've given "best answer" to two different people, one of whom thinks it's a bad idea and the other of whom thinks it's a great idea.

I respectfully submit that your project needs re-thinking.
posted by AmbroseChapel at 9:44 PM on October 31, 2006


« Older Strategies for securing VPN Access?   |   Help my color scheme not suck. Newer »
This thread is closed to new comments.