Mining news sites for data.
January 23, 2004 5:02 PM   Subscribe

Is there a way, without constant human intervention, to (1) mine either Google News, Yahoo News, or the AP for new obituaries and (2) drop the name, age, blurb, and URL into a database?

I've pondered this for a while. A really crude way would be to search headlines for ", [0-9][0-9], " and " dies at [0-9][0-9]." But I'm not sure this would pick up everything. For example, if I search Google News for "kangaroo" I get only two links out of about 20 that identify Bob Keeshan's name, the reason for his fame, and his age. Most say simply "Captain Kangaroo Dies". And only the NYT headline has all the data elements separated by commas (and is likely not consistent on that point with each obit.)

Any cleaner ideas?
posted by PrinceValium to Computers & Internet (5 answers total)
 
Might be easier to use Celebrity Death Beeper or any number of similar services.
posted by oissubke at 5:19 PM on January 23, 2004


And there's also the Blog of Death.
posted by davidmsc at 6:37 PM on January 23, 2004


Google does not take kindly to automated mining.
posted by srboisvert at 8:48 PM on January 23, 2004


Response by poster: Google does not take kindly to automated mining.

Isn't automated mining all that Google does?
posted by PrinceValium at 6:17 AM on January 24, 2004


Use this and your favourite XML parser. Then take the first string of capitalized words as the name and the next bunch as either a blurb or an age. You could probably do it in 20 minutes in perl.
posted by cmonkey at 10:49 AM on January 24, 2004


« Older Installing Propane Gas Logs   |   Where can I find clips of very early films? Newer »
This thread is closed to new comments.