Mashing up government data
May 29, 2008 2:31 PM   Subscribe

I'm trying to lead a crusade for government to publish it's statistics and data in a way that is mashable. Is it possible to define a standard digital format that could apply to a diverse array of data sets?

The sort of examples for which this might apply are crime statistics, bus and train travel times, recent house sales, mapping data etc.

If it were possible to define a standard, what would it be and how could you describe it to people without an IT background?
posted by baggymp to Technology (12 answers total) 5 users marked this as a favorite
For mapping stuff, I recommend getting in touch with the Open Geospatial Consortium. From their site: "The Open Geospatial Consortium, Inc.® (OGC) is a non-profit, international, voluntary consensus standards organization that is leading the development of standards for geospatial and location based services."
posted by desjardins at 2:54 PM on May 29, 2008

Also, maybe I'm out of my league here, but I'm not sure what you mean by "standard digital format" other than something like a CSV file (and what's wrong with that?). Obviously each dataset would contain differing information, and different regions would have specific needs. Stuff like geocoding should ideally all be in the same format, but then again you run into projection/datum issues when you're talking about a country the size of the US. I see you're in London so it's less of a problem in the UK, but still.
posted by desjardins at 2:58 PM on May 29, 2008

but I'm not sure what you mean by "standard digital format" other than something like a CSV file (and what's wrong with that?)

Or, you know, XML. Kinda what it is designed to do: be self-describing data.
posted by trinity8-director at 3:11 PM on May 29, 2008

OK, let me clarify for the OP.

XML is a way of producing arbitrary data and having unknown clients be able to understand that data via data decriptions called DTDs. This is what you are looking for. Plenty of XML-related resources around: online, at the bookstore, at the library. It's not terribly complicated to understand, either.

Having the gov't expose their data as XML is possible as more and more databases support XML. Whether any particular department is running an XML-capable database is another question.

Naturally, there will be concerns about personal/confidential data that will need to be addressed.
posted by trinity8-director at 3:18 PM on May 29, 2008

XML doesn't really solve the problem; all it is is a concrete syntax and a semantic-free grammar. You still need to define the data format on top of that.

There are public standards in use for publishing spatially-referenced data, such as the USGS' Spatial Data Transfer Standard. The hard part, if you want to make a mashup of some random thing, is usually going to be collecting the data from multiple disparate sources and putting it into a common form without introducing artifacts and biases— maybe this data set is pixies per square mile, and that one is pixies per county, so you'll need to reference it to other mapping data so you can compare; a third data set lumps pixies into "supernatural beings, small", and a fourth one only includes pixies sighted during daytime by registered notaries… compared to all that, doing some file format conversions is a minor issue.
posted by hattifattener at 3:43 PM on May 29, 2008

You still need to define the data format on top of that.

Well, somebody would, yes. That's the DTD. Typically that falls to the people who publish the data. The DTD describes what the data structure is.

Unless, you mean the descriptions of what the data means? That's another separate issue. Can't rely on field names from the database because many times those are just nonsense that only DBAs can appreciate.

Still, descriptions could be added as comments in the DTD but there is no standard way of doing (read: enforcing) that.
posted by trinity8-director at 4:00 PM on May 29, 2008

You might want to check out mailing lists on Check out the archives and you'll find a few other people interested in doing this sort of mashup work. It's a really friendly group always looking for people doing cool things.
posted by eisenkr at 4:06 PM on May 29, 2008

T8d: The DTD or schema just describes the grammar, that is, the structure of the file, which at best only generally approximates the structure of the data. I've spent far too much time trying to extract information from XML files given only a DTD or schema and some comments to be under the illusion that XML is more than a minor part of the task of defining a data interchange format.
posted by hattifattener at 4:30 PM on May 29, 2008

You could also take a look at the netCDF libraries. The libraries encompass several formats, one common one being HDF5, and have been used in many different applications-- the wikipedia article I linked mentions climatology and GIS, but I first encountered them in a neuroimaging analysis package. It seems like it has a few differences from XML that make it more suited for larger datasets. For starters, its a binary format, and it can store somewhat regular data in a more structured and less space-wasting way. It also supports better indexing into the middle of the file. With XML, to be completely sure what you are getting, you need to parse the file up to a certain point.

Many of the same caveats of XML files still apply though, because even though the file format is well-defined, the data and interpretation of the data is not.

I guess that HDF5 and XML can both be described as a box that you can place other labelled boxes into, and that you can put boxes into _those_ boxes and so on. But, the problems start to happen when you trade your useful dataset with your friend. Maybe you look everywhere through your friend's box for the "gender" box, until you *smack head* and realize that you need to look for the "sex" box. Maybe you "mashup" your data with your friend's by dumping your "phone numbers" box into his "phone numbers box" only to realize that neither of you wrote the area codes into the phone numbers and that they were from two different area codes. Maybe his box labelled "phone numbers" actually holds SSN's and are mislabelled for "security" purposes. Basically, with a flexible file format, there is no end to the ways that data can get bollixed up and misinterpreted.
posted by Maxwell_Smart at 6:27 PM on May 29, 2008

For all the reasons mentioned above (not to mention other inherent challenges in interpreting a lot of the stuff that goes through government/administrative databases and the quality control issues there) this is a worthwhile but thorny pursuit. But more directly to the point of your question, you might find some good resources here.
posted by shelbaroo at 7:01 PM on May 29, 2008

I'm sure there's a website that plots crime data. They invite local police departments to send them the information. They claim it should be easy because they accept the data in the same format that the departments are already required to use when reporting crime to the FBI. Naturally, I can't find that website. Sorry. Anyway...

Here's someone else who is already mashing up government data: EveryBlock. And here's a city offering up vast amounts of data. NYC OPS.
posted by stuart_s at 7:50 PM on May 29, 2008

Tom: I think the key to publishing any such data (and good luck with that) is not in the format, provided that the data isn't locked up in some annoying binary file type.

Far more useful to those of us who'd like to use the data is that you publish each stream of data in a format which:

a) does not change,
b) explicitly makes any revisions available (just republishing with a different filename would achieve that), and
c) exists at a URL which a computer can reliably guess.

Under c) you could get your recalcitrant quangos to either post "2008-05-30/", adjust the revision numbers and dates daily/more-than-daily, and keep them all in a known directory. Or make it known that a particular RSS feed will contain nothing but those files as enclosures, in the same way podcasting works. Or something similar.

But if you want people to make use of the data, the metadata and the predictability is what enables people to build things without having to manually review each released file.
posted by genghis at 6:25 AM on May 30, 2008

« Older Help me pick a web dev language to learn.   |   Why do my DVD drives crap out so quickly? Newer »
This thread is closed to new comments.