Newspaper Clippings 2.0?
May 7, 2009 3:36 PM   Subscribe

Can I automate archiving/saving news articles on a certain topic I pull from google new's rss feed?

For a while, I was manually copying and saving all news articles from google on a certain topic I got in my rss feed. But it became cumbersome so I stopped. Now I've looked back at those articles from a few years ago, and wish I had kept up. Is there a way to automate something like that? A modern day newspaper clipping collection only automated? I don't want to save just the url, but the actual text of the article, where it is from, date, and possibly pictures.

This is for my own personal use, so I doubt it would fall under any copyright issues (I would assume).

I did a search but my google fu is failing me. I keep coming up on the google news archive, but thats not really what I'm looking for. I want my own personal copies. I don't now how google news archive works, but I know that some articles I had gotten from google news originally are not in their archive (I just checked.).
posted by [insert clever name here] to Computers & Internet (4 answers total) 3 users marked this as a favorite
 
Set up a Google Alert. Under the delivery options, choose "feed." Does that do what you're looking for?
posted by Jaltcoh at 3:40 PM on May 7, 2009


Response by poster: No. Google alert is just the start of it. That's more how I was doing it before, but then I'd manually save the stories to a mysql database and output them in a list. That doesn't have to be the output, but some what of archiving in flat files or a db as my own copy is what I'm looking for.
posted by [insert clever name here] at 4:45 PM on May 7, 2009


Maybe you could set up a cron job to run a Python script that would parse the feed's xml file for links to download?

IBM has a Python script for parsing RSS feeds. That script doesn't quite download the files, but you could probably modify it to something like this:
from RSS import ns, CollectionChannel, TrackingChannel
import urllib

tc = TrackingChannel()
tc.parse("http://news.google.com/?output=rss")

RSS10_TITLE = (ns.rss10, 'title')
RSS10_DESC = (ns.rss10, 'description')

items = tc.listItems()
for item in items:
	url = item[0]
	print "RSS Item:", url
	item_data = tc.getItem(item)
	newsItem = urllib.urlopen(url)
	savedItem = open(item_data.get(RSS10_TITLE, "(none)"), 'w')
	savedItem.write(newsItem.read())

	newsItem.close()
	savedItem.close()

But as I haven't tested the above, there's no guarantee it'll work right off the bat.
posted by movicont at 6:07 PM on May 7, 2009


Best answer: There's two ways to do this. One is something that just sucks in all the things you've subscribed to. Gregarius is a web application that would like you do this, though there are others that vary in degree of complexity and features.

Another option is to use software like DevonTHINK. I have my browser set up to automatically save a local copy of a page into DevonTHINK (which is fully searchable and has it's own AI engine) with the keystroke of Command + 2. (Command + 1 is the "blog this and quote the text I've highlighted shortcut.)

(DevonTHINK can also subscribe to RSS feeds and let you just search everything. I do a combination of the above two methods.)
posted by Brian Puccio at 8:46 PM on May 10, 2009


« Older Mysterious Urn   |   What do I need to know about coffin building... Newer »
This thread is closed to new comments.