Help me populate a spreadsheet by scraping an RSS feed.
April 4, 2012 10:21 AM   Subscribe

I would like to scrape information from an RSS feed into an Excel-readable text file for a completely legal non-copyright violating use. In a better world, I'd have access to the database that generates the feed, but since this ain't a perfect world it appears that scraping is my best bet. Are there tools that will help me automate this, or programming tutorials that will help me figure it out myself (it's been 15 years since I last write any code beyond simple SQL queries)?

The XML is formatted thus, for each new post (but with angle brackets where I've put square brackets):

[item]
[title]A title[/title]
[link]http://URL[/link]
[guid isPermalink="true"]http://URL[/guid]
[description]Description, which may include embedded links and images.
[/description]
[/item]

I'd like to scrape this into an Excel-readable format, where each row consists of:
TITLE, URL (from "guid" Permalink, not from "link"), DESCRIPTION (First 50 characters, don't need links or images).

In an even more ideal world, I'd be able to do this in a smart enough manner that if I scrape the feed every day my software/widget/whatever tool can distinguish new content and only scrape that.

I know this is possible, would be super easy for the right programmer, and that without any help I could probably even cobble something together in a month or two. But I'm a writer without access to the "right programmer," and I'd really prefer not to take 1-2 months to try to figure it out.
posted by croutonsupafreak to Computers & Internet (9 answers total) 5 users marked this as a favorite
 
Sounds pretty simple, if you're willing to skip Excel. Set up a database for this, and it becomes pretty a simple python script:

1. Load up beautiful soup
2. download the rss file
3. parse the rss file with beautiful soup
4. iterate over the items
5. for each item, check to see if the guid is already in the database and discard it if is
6. write the remaining items to the database, truncating the description text.

If even that sounds like too much work, perhaps consider this really ghetto approach, something like putting the rss into Liferea, and writing a small script to write a CSV file from the liferea SQLite database could also work. I used a similar process to export my blog from LiveJournal to markdown; its a really short script, if you exclude the HTML->Markdown conversion.
posted by pwnguin at 10:48 AM on April 4, 2012


I'd do it in Perl. I'm assuming that you can get to a Perl environment, either via the command-line on Linux or OS/X, or via Cygnus tools or the ActiveState Perl port on Windows.

Here's a super quick stupid hack that probably misses your guid vs link distinction (I'm using Perl's XML::RSS module to output stuff, and haven't looked at the output other than to confirm it works in my RSS reader...). This also doesn't try to do anything about caching fines locally, or only outputting changed records, or what-have-you.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use XML::RSS;

my $content = get('http://www.flutterby.net/User:DanLyke_status.rss')
    || die "Unable to get RSS\n";

my $rss = XML::RSS->new();
$rss->parse($content);
foreach my $item (@{$rss->{'items'}})
{
    print join(',', map {$a = $item->{$_}; $a =~ s/\\/\\\\/g; $a =~ s/\'/\\'/g; "'$a'"}
               ('title', 'link', 'description'));
    print "\n";
}
Known flaws:
  • For caching files, if you do a perldoc LWP::Simple you'll see a "mirror" that you can use instead of "get" that will save bandwidth, and you can then use $rss->parsefile.
  • There's a better way to output CSV versions of " $item->{title}" etc.
  • You need to figure out how to handle duplicates.
This help?
posted by straw at 10:52 AM on April 4, 2012


I cobbled this together, it's not awesome and it has an external dependancy (feedparser) but it kinda works. At least the fields I used worked with a sample feed.

#!/usr/bin/env python

import feedparser,csv,sqlite3

con = sqlite3.connect('your_storage.db')
cur = con.cursor()
try:
cur.execute("DROP table entries")
except:
pass
finally:
cur.execute("""
CREATE TABLE entries(
url PRIMARY KEY,
title,
description)""")

rss = feedparser.parse('http://twitter.com/statuses/user_timeline/DBAHULK.rss')
for entry in rss.entries:
try:
cur.execute("""
INSERT INTO entries (url,title,description)
VALUES (?,?,?)""",
(entry['id'],entry['title'],entry['summary'][:50]))
except Exception as e:
print "error (%s) inserting '%s' -- %s" % (type(e),entry['id'],e.args[0])

F = open('your_csv_output.csv','w')
cur.execute("SELECT * FROM entries")
for entry in cur.fetchall():
row = """"%s","%s","%s"\n""" % (entry[0],entry[1],entry[2])
F.write(row.encode('utf8'))
F.close()
posted by mce at 11:04 AM on April 4, 2012


Best answer: Is Excel XML mapping no good?
posted by urbanwhaleshark at 11:06 AM on April 4, 2012 [1 favorite]


Best answer: Ugh. I should explain:

1)I was stupid and let my whitespace get stripped, so of course no python interpreter will run that.
2)feedparser is a very robust library for parsing rss and atom feeds. In an attempt to universalize the difference between feed specs some fields might not end up with the names you expected
3)It works by letting the sql library enforce the uniqueness of the guid, swallowing errors that pop up when you try to re-add an existing entry
posted by mce at 11:08 AM on April 4, 2012


Response by poster: You folks are THE BOMB(s). It looks like Excel XML mapping does exactly what I'm looking for, minus the bonus "avoid stuff already scraped," and I'm gonna spend a bit of time familiarizing myself with feedparser in the mean time. Gracias!
posted by croutonsupafreak at 11:17 AM on April 4, 2012


Best answer: Another possibility - gdocs has a function ImportFeed. Here's an example.
posted by mce at 11:39 AM on April 4, 2012


A quick hack using the Mojolicious Perl framework is this Gist. Mojo::UserAgent has a nice could-be-one-line flow to it with a pretty nice HTML5/XML DOM with CSS-like selection ( find('item guid[isPermalink="true"])->each(...) ). If you were to go the Perl route...

...
my $csv = Text::CSV_XS->new;                                                    

sub do_url {
    my $URL = 'http://www.flutterby.net/User:DanLyke_status.rss';
    my $res = Mojo::UserAgent->new->get($URL)->res;
    my $xml = Mojo::DOM->new->xml(1)->parse( $res->body );

    $xml->find('item link')->each(
        sub {
            my ($i) = @_;
            my $p = $i->parent;
            $csv->combine( map { $p->$_->text } qw(title link description) );
            say $csv->string;
        }
    );
}
...

posted by zengargoyle at 12:56 PM on April 4, 2012


You could use the PowerPivot add-in for Excel and use its built-in import from RSS functionality. It has a GUI interface and it can bring in the stuff you want in no time. The imported results can then be analyzed in a pivot table. To get more data, click Refresh.

Although PowerPivot can easily do a full data refresh, PowerPivot is not smart enough to do incremental refresh from RSS feeds. For that, you need the full version of Analysis Services in SQL Server 2012.
posted by crazycanuck at 2:15 PM on April 4, 2012


« Older Where to buy inexpensive checks online?   |   Can you decode this Javascript spam? Newer »
This thread is closed to new comments.