I don't work for the government, promise.
June 11, 2009 6:32 PM Subscribe
I'm trying to map protests in the United States, but I'm grappling with data sources (and will eventually tangle with data management). Any ideas?
I'd like to map out protests, riots, bombings, and other cheerful social outings - ideally in the United States, where I have the most contextual knowledge, but that's not a necessity.
My original plan was to scrape AP's US news RSS feed, store everything in some sort of XML database, and then query that for what I need. I just checked their RSS format, and it unfortunately doesn't include the full article. Nor does it include a separate tag for the location, which would make geocoding a bit/much nastier. NYT's feeds are basically the same story. I don't really know where to go from here.
There are basically five steps, and I would love advice on any:
1. Scrape database of news articles.
2. Store in a format that would allow querying by date or location. I'd like to keep all the articles, too, because... really, that would be an awesome dataset.
3. Tag protests (method: NLP, Mech Turk, or caffeinated McB).
4. Tag with date and location.
5. Make pretty maps.
Step 6 is going crazy with spatial stats, but I've got that part covered. I've been letting this project fester for too long, and it is now certifiably brain crack. Any advice on 1-5 would be greatly appreciated.
Aside: I really have thought about the ethical consequences of this. If you're concerned, MeFiMail me and I'll do my best to assuage your doubts.
I'd like to map out protests, riots, bombings, and other cheerful social outings - ideally in the United States, where I have the most contextual knowledge, but that's not a necessity.
My original plan was to scrape AP's US news RSS feed, store everything in some sort of XML database, and then query that for what I need. I just checked their RSS format, and it unfortunately doesn't include the full article. Nor does it include a separate tag for the location, which would make geocoding a bit/much nastier. NYT's feeds are basically the same story. I don't really know where to go from here.
There are basically five steps, and I would love advice on any:
1. Scrape database of news articles.
2. Store in a format that would allow querying by date or location. I'd like to keep all the articles, too, because... really, that would be an awesome dataset.
3. Tag protests (method: NLP, Mech Turk, or caffeinated McB).
4. Tag with date and location.
5. Make pretty maps.
Step 6 is going crazy with spatial stats, but I've got that part covered. I've been letting this project fester for too long, and it is now certifiably brain crack. Any advice on 1-5 would be greatly appreciated.
Aside: I really have thought about the ethical consequences of this. If you're concerned, MeFiMail me and I'll do my best to assuage your doubts.
regarding data sources--you need to use an aggregation of data, not just one or two news feeds. you may wish to monitor things like press releases from various political organizations inclined to protesting/controversey, such as peta and labor unions.
posted by lester at 7:26 PM on June 11, 2009
posted by lester at 7:26 PM on June 11, 2009
"Protest" is a pretty vacuous term to some degree, riots and bombings less so. I'd narrow down my definition some to weed out insignificant "noise."
I think you're ahead of the curve on your data gathering to the point where it's taking you down a blind alley. Someday a smart implmentation of RSS (or whatever) will be able to deliver results like you're seeking, but it ain't there yet.
proj is on the right track - there is lots of info out there that is already cataloging this type of information. Maybe not geocoding it or doing other cool stuff with it, but collecting it nonetheless. Are you at a university? If so fire off some emails to anyone who might provide useful direction.
posted by wfrgms at 10:10 PM on June 11, 2009
I think you're ahead of the curve on your data gathering to the point where it's taking you down a blind alley. Someday a smart implmentation of RSS (or whatever) will be able to deliver results like you're seeking, but it ain't there yet.
proj is on the right track - there is lots of info out there that is already cataloging this type of information. Maybe not geocoding it or doing other cool stuff with it, but collecting it nonetheless. Are you at a university? If so fire off some emails to anyone who might provide useful direction.
posted by wfrgms at 10:10 PM on June 11, 2009
Response by poster: Thanks for all the feedback so far - I'm barely awake and will write up a proper response when I get a chance, but I just wanted to clarify that this is mostly a technical exercise. The precursor to this project was done in an academic context, used a publicly available dataset of protests, addressed the social movements lit on the ambiguity of the "protest", etc. Right now, I'm more interested in getting the tech skills to pull something like this off, rather than getting dissertation-quality results.
Thank you, though, and carry on!
posted by McBearclaw at 11:04 PM on June 11, 2009
Thank you, though, and carry on!
posted by McBearclaw at 11:04 PM on June 11, 2009
You could check if the articles come with photographs and, if they do, download them and check for XMP or EXIF data; I read an article a while ago about a newspaper publishing an article with a photograph of an anonymous source, and having the photo metadata saying where it was taken. Oops.
Another option might be to look at the protesters' website, particularly if there's one group behind several protests, as they would want to publicise the protest and its location etc. Of course, you don't want to bias your data to over-represent web-savvy protesters.
posted by Mike1024 at 12:27 AM on June 12, 2009
Another option might be to look at the protesters' website, particularly if there's one group behind several protests, as they would want to publicise the protest and its location etc. Of course, you don't want to bias your data to over-represent web-savvy protesters.
posted by Mike1024 at 12:27 AM on June 12, 2009
This thread is closed to new comments.
posted by proj at 6:59 PM on June 11, 2009