Mining news sites for data.
January 23, 2004 5:02 PM Subscribe
Is there a way, without constant human intervention, to (1) mine either Google News, Yahoo News, or the AP for new obituaries and (2) drop the name, age, blurb, and URL into a database?
I've pondered this for a while. A really crude way would be to search headlines for ", [0-9][0-9], " and " dies at [0-9][0-9]." But I'm not sure this would pick up everything. For example, if I search Google News for "kangaroo" I get only two links out of about 20 that identify Bob Keeshan's name, the reason for his fame, and his age. Most say simply "Captain Kangaroo Dies". And only the NYT headline has all the data elements separated by commas (and is likely not consistent on that point with each obit.)
Any cleaner ideas?
I've pondered this for a while. A really crude way would be to search headlines for ", [0-9][0-9], " and " dies at [0-9][0-9]." But I'm not sure this would pick up everything. For example, if I search Google News for "kangaroo" I get only two links out of about 20 that identify Bob Keeshan's name, the reason for his fame, and his age. Most say simply "Captain Kangaroo Dies". And only the NYT headline has all the data elements separated by commas (and is likely not consistent on that point with each obit.)
Any cleaner ideas?
Google does not take kindly to automated mining.
posted by srboisvert at 8:48 PM on January 23, 2004
posted by srboisvert at 8:48 PM on January 23, 2004
Response by poster: Google does not take kindly to automated mining.
Isn't automated mining all that Google does?
posted by PrinceValium at 6:17 AM on January 24, 2004
Isn't automated mining all that Google does?
posted by PrinceValium at 6:17 AM on January 24, 2004
Use this and your favourite XML parser. Then take the first string of capitalized words as the name and the next bunch as either a blurb or an age. You could probably do it in 20 minutes in perl.
posted by cmonkey at 10:49 AM on January 24, 2004
posted by cmonkey at 10:49 AM on January 24, 2004
This thread is closed to new comments.
posted by oissubke at 5:19 PM on January 23, 2004