Server-side script for combining RSS feeds?
January 2, 2009 10:43 AM   Subscribe

Do you know of a server-side script for combining RSS feeds, removing duplicates, filtering out items that match keywords, and then generating a new RSS feed as a result?

I'm looking for something that can be run on a standard web server with a typical LAMP configuration. I have 189 feeds, 3500+ feed items, and 500+ keyword filters I want to combine.

I'd settle for a combination of services that successfully can each do one of the tasks of combining feeds, filtering out duplicates, acting on keywords matches, and then generating a clean, unified feed.

This feed will be loaded into a shared Google Reader account, where various people will look at the combined feed items and star some for later attention, so I'm not looking for a client-side single-user feed reader.

Newsgator and Bloglines do not allow the marking and sorting of posts that I want, nor are they particularly efficient at going through hundreds of posts in a single sitting. Google Reader's single-letter keyboard commands are hard to beat.

Google Reader, however, does not do filtering (its biggest drawback out of the many drawbacks it has). There is a Greasemonkey script for doing filtering, but the people who will be viewing this combined feed either do not use Firefox or do not have the technical aptitude to use and update Greasemonkey. Further, I add and revise the keyword filters dozens of times a day, so a client-side filter doesn't really work. It needs to be server side so my changes are reflected wherever the work is being done.

Yahoo Pipes chokes and fails on the feeds. It only sporadically pushes out XML, it only pushes out a small bit of it, and it does it infrequently and after much delay.

FeedRinse keeps failing to add all of my feeds, inexplicably not saving them when I add them to a channel. The ones that are added to a channel are not pushing out any aggregated XML. The individually rinsed channels do load, but I don't want to individually add all those keyword filters to all those feeds, which is the point of channels. I would spend days just entering in the keyword filters.

MySyndicaat does not seem to be permitting new registrations (hitting the submit button throws up a very stupid pop-up that tells you to go to another site, where they lead you right back to the same registration that doesn't work). RSSMix, which I might be able to use to at least combine the feeds, does accept and read the feed but times out when I try to read the aggregated feed it produces.

MyFeedz is shut down.

Google Reader's sharing will only show 20 items of a shared folder containing all the feeds; I need it to show ALL of those items in the shared folder. Otherwise, it might serve as a decent feed aggregator.
posted by Mo Nickels to Computers & Internet (13 answers total) 2 users marked this as a favorite
 
Have you tried simplepie? I am not sure about "keyword support" but it is certainly easy to merge a series of feeds with. (merge feeds tutorial)
posted by shownomercy at 11:20 AM on January 2, 2009


Response by poster: Ooh, that looks good. It didn't come up at all on a Google search, which I hope says more about the spammy sites full of dreck called "scripts" than it does about my Google-fu.
posted by Mo Nickels at 11:55 AM on January 2, 2009


What about building your own Yahoo Pipe? It's pretty easy to setup and there are plenty of pipes you can clone to see how they work. On a whim I built one that takes the Fat Wallet RSS feed and only shows me posting that have a rating higher than 5. It weeds out all the garbage.
posted by bleucube at 12:31 PM on January 2, 2009


Response by poster: Unfortunately, it looks like SimplePie doesn't regenerate a single feeds from the feeds once it collates them and the single SimplePie-using script I found that claims to do this generates faulty XML.

I'll continue to work with it, but I'm still looking for suggestions.
posted by Mo Nickels at 12:36 PM on January 2, 2009


I use ReBlog for something like this. It's not in active development, but it works. You might run into a problem with the rss output. The solution is found here.
posted by elle.jeezy at 12:45 PM on January 2, 2009


There's also CaRP which I use as well.
posted by elle.jeezy at 12:48 PM on January 2, 2009


Response by poster: Bleucube, read my original question. Yahoo Pipes is a big fat fail with this.
posted by Mo Nickels at 12:59 PM on January 2, 2009


Response by poster: That web site CaRP looks like a huge turd. I don't think I'd risk giving my money to someone who looks like a huckster.
posted by Mo Nickels at 1:36 PM on January 2, 2009


Yeah, well, there is that, but I've found the script to be quite useful, and highly configurable. I also got it before he increased all the marketing. Finally, there's a free version buried in there somewhere. I ignore the huckstering. :)
posted by elle.jeezy at 2:13 PM on January 2, 2009


Here's a go at the problem. It does combining and filtering but not deduplication. Right now each time it runs it checks each item in each feed and overwrites the output. This would be easy to change if you wanted to run the script at a given interval.
#!/usr/bin/perl

use strict;
use warnings;
use XML::RSS;
use LWP::Simple;

use vars qw{$feeds_path $keywords_path $output_path};

# one url per line
$feeds_path = 'feeds.txt';

# one lowercase search term per line
$keywords_path = 'keywords.txt';

$output_path = 'nickels.xml';

sub main {

  my $rss_reader = new XML::RSS;
  my $rss_writer = new XML::RSS( version => '2.0' );

  my @keywords;
  open my $kp, '<>) {
    chomp $_;
    push @keywords, $_;
  }
  close $kp;

  open my $fp, '<>) {

    chomp $_;
    my $feed_url = $_;

    my $feed_string = get($feed_url);
    unless ( defined $feed_string ) {
      warn "skipping feed because could not fetch $feed_url\n";
      next;
    }

    $rss_reader->parse($feed_string);

    foreach my $item ( @{ $rss_reader->{'items'} } ) {

      foreach my $keyword (@keywords) {

        # if keyword is in this post add it to output
        if ( lc( $item->{'title'} ) =~ /$keyword/ ) {
          push @{ $rss_writer->{'items'} }, $item;
          next;
        }

      }

    }

  }

  close $fp;

  # save feed of matching items
  $rss_writer->save($output_path);

  return;

}

main();

posted by PueExMachina at 6:26 PM on January 2, 2009


Response by poster: Thanks, PueExMachina, for the script. I may have a go at customizing it for the task because the keywords are all to *eliminate* certain items rather than to exclude them.
posted by Mo Nickels at 7:38 PM on January 2, 2009


Response by poster: Err, rather than INCLUDE them.
posted by Mo Nickels at 8:16 PM on January 2, 2009


Glad I could help. Since you're searching for multiple patterns, it may be faster to use an algorithm like Rabin-Karp (there an implementation on CPAN).
posted by PueExMachina at 3:42 PM on January 4, 2009


« Older My beard is flaky... in the not-good way.   |   96 Accord is a Mover & Shaker Newer »
This thread is closed to new comments.