Help me populate a spreadsheet by scraping an RSS feed.
April 4, 2012 10:21 AM Subscribe
I would like to scrape information from an RSS feed into an Excel-readable text file for a completely legal non-copyright violating use. In a better world, I'd have access to the database that generates the feed, but since this ain't a perfect world it appears that scraping is my best bet. Are there tools that will help me automate this, or programming tutorials that will help me figure it out myself (it's been 15 years since I last write any code beyond simple SQL queries)?
The XML is formatted thus, for each new post (but with angle brackets where I've put square brackets):
[item]
[title]A title[/title]
[link]http://URL[/link]
[guid isPermalink="true"]http://URL[/guid]
[description]Description, which may include embedded links and images.
[/description]
[/item]
I'd like to scrape this into an Excel-readable format, where each row consists of:
TITLE, URL (from "guid" Permalink, not from "link"), DESCRIPTION (First 50 characters, don't need links or images).
In an even more ideal world, I'd be able to do this in a smart enough manner that if I scrape the feed every day my software/widget/whatever tool can distinguish new content and only scrape that.
I know this is possible, would be super easy for the right programmer, and that without any help I could probably even cobble something together in a month or two. But I'm a writer without access to the "right programmer," and I'd really prefer not to take 1-2 months to try to figure it out.
posted by croutonsupafreak to computers & internet (9 answers total) 5 users marked this as a favorite
1. Load up beautiful soup
2. download the rss file
3. parse the rss file with beautiful soup
4. iterate over the items
5. for each item, check to see if the guid is already in the database and discard it if is
6. write the remaining items to the database, truncating the description text.
If even that sounds like too much work, perhaps consider this really ghetto approach, something like putting the rss into Liferea, and writing a small script to write a CSV file from the liferea SQLite database could also work. I used a similar process to export my blog from LiveJournal to markdown; its a really short script, if you exclude the HTML->Markdown conversion.
posted by pwnguin at 10:48 AM on April 4, 2012