How to extract a URL from a RSS feed with regex in Yahoo Pipes?
April 23, 2010 6:22 PM   Subscribe

Regexfilter: I need help filtering a RSS feed with regular expressions in Yahoo Pipes. Specifics inside.

In Yahoo Pipes, I’m using the regex module on a RSS feed in an attempt to extract a URL from the item.description field.

The added difficulty is I want the regex to filter one specific domain (for the purposes of this: EXAMPLE.COM) and replace the feed item with just the URL.

In layman's terms I want to do this to the page:
[trim everything before this]EXAMPLE.COM/MORE/MORE[trim everything after this]

The following worked for stripping out a URL:

in item.description replace
(?i).*?href="([^"]*).*
with $1

checkboxes:
[ ] g [x] s [ ] m [ ] i

But I don't understand regex enough to make it work for one specific domain. Thanks in advance for any help you can provide.
posted by sharkfu to Computers & Internet (4 answers total) 3 users marked this as a favorite
 
Best answer: I'm not super familiar with Pipes regex, but try:
(?i).*?href="(http://www\.example\.com[^"]*).*

And you probably want to check that "i" box so it's not case sensitive
posted by meta_eli at 6:53 PM on April 23, 2010


Best answer: Yeah, I'd do something simple like

(using | as delimiter to avoid escaping the forward slashes):

|href=http://(example.com/[\w\d\.]*/[\w\d\.]*/)|

(do you want the 'http://' or no?)

and then refer to the backreference, but I dunno if you can do that in Yahoo pipes.

I've added a last / at the end, 'cause I don't know how you would end your expression. Is the last "more" always consistent? Or will the "mores" change? Will there always be two forward slashes? My regexp may be too specific, and you could just use something like meta_eli's which will capture more but is less complex. Questions, questions...what you're trying to do shouldn't be hard but I'm not sure on the exact specifications.
posted by dubitable at 8:15 PM on April 23, 2010


Regexes can only reliably deal with regular languages and RSS/XML/HTML are not regular languages so regular expressions will inevitably be brittle when applied to them, and they will suffice only so long as the input doesn't use all the permissible variations that it is allowed to. For example, a regex won't deal with <!-- some EXAMPLE.COM link that should really be ignored -->.

I don't mean to say this to avoid answering your regex question but instead to warn readers that this isn't a robust approach to parsing RSS/XML/HTML.

Instead I suggest using XPath or XSLT.
posted by holloway at 6:03 AM on April 24, 2010


You don't have to do everything in one regex. You might be better off putting in an additional filter step beforehand - drag in a Filter to select just those items that contain example.com before sending them to the regex.
posted by Electric Dragon at 9:36 AM on April 24, 2010


« Older Checked baggage fees on domestic flights for...   |   What tv show used "at the castle it is raining?" Newer »
This thread is closed to new comments.