How can I get my two regex subexpressions to match repeatedly?
July 12, 2009 10:42 PM   Subscribe

Regular Expressions: How can I get my two subexpressions to match repeatedly? Right now my pattern matches the first instances but refuses to match anything after it.

I've spent a few hours looking at regex tutorials, examples, and fiddling with a regex tester program and I'm obviously not making the intuitive leap.

Here's my pattern (PHP) and my sample text.

I'm trying to pick the four sets of HREF locations and link text (the subexpressions)
out of the html soup. So far $matches contains the whole subsection, and only the first set of location/text that I want. I suspect my failure is that some portion of the expression isn't greedy. Right?

Please hope me!
posted by cowbellemoo to Computers & Internet (6 answers total)
 
Best answer: No, your regular expression is fine -- it's just that it starts and ends with "BEGIN TODAYS NEWS CONTENT" and "END TODAYS NEW CONTENT", both of which only appear once in the input text, so the regexp as a whole is only going to match once, right? You can probably mess around with look-behind and look-ahead, but the easiest way to get this to work is to write two regexps, one to extract the portion of text you're interested in, and the other to match against it:

$outer_pattern = '#BEGIN TODAYS NEWS CONTENT(.*?)END TODAYS NEWS CONTENT#s';
preg_match_all($outer_pattern, $input, $matches, PREG_SET_ORDER);

$link_section = $matches[0][1];

$inner_pattern = '#<a href="(.*?)".*?<b>(.*?)</b>#s';

preg_match_all($inner_pattern, $link_section, $matches, PREG_SET_ORDER);

foreach ($matches as $m) {
echo "extracted: " . $m[1] . " : " . $m[2] . "\n";
}
posted by inkyz at 11:08 PM on July 12, 2009


Best answer: Parsing HTML with regular expressions is fiddly and error-prone, and that's putting it mildly. Why not use a prebuilt parser?
posted by flabdablet at 1:40 AM on July 13, 2009 [2 favorites]


Best answer: Like inkyz said, you will need to perform two operations. You can craft a regular expression to match the middle groups multiple times (to wit: (?:<a href="(.*?)".*?<b>(.*?)</b>.*?)+), but subsequent matches will overwrite the original match group, only yielding the last set.

Also, using regular expressions to scrape HTML is insanity. Find a DOM parser for your platform that lets you do this programmatically. I don't know PHP well, but an example of what this operation would look like in Python using BeautifulSoup:


    start = soup.body.find(text=' BEGIN TODAYS NEWS CONTENT ')
    table = start.findNext('table')
    for anchor in table('a'):
        href = anchor['href']
        title = anchor.find('b').renderContents()


I'm sure PHP has something similar. While there is a slight performance hit parsing a DOM vs. pre-compiling a custom regular expression, the former is more readable, maintainable, and less error-prone when small changes occur in your source material.
posted by cj_ at 4:28 AM on July 13, 2009


Response by poster: Oh, splendid. I went with the DOM parser approach since I want clean code. You should have seen the shameful series of regex patterns in the original script... Anyway, here's my code after a quick study of flabdablet's linked parser. Lovely and versatile. Thanks everyone!
posted by cowbellemoo at 8:05 AM on July 13, 2009


Glad you got a result.

Regular expressions are neat and all, but they truly are the closest thing that the world of software design has to the hammer that makes everything look like a nail. If you ever find yourself doing anything truly complicated with REs, there's usually a cleaner way available - but you have to step back from the problem to find it, and REs are such promising things that they often make that emotionally difficult to do :-)
posted by flabdablet at 4:12 PM on July 13, 2009 [1 favorite]


Glad that worked out for you, and that I could at least contribute to talking someone out of the regex rabbit hole. They are very powerful but not suited for every problem. This article sums up my feelings about PCRE pretty well.
posted by cj_ at 6:22 PM on July 14, 2009


« Older What is Microsoft's effect on the tech industry?   |   Do I need insurance for my web business? Newer »
This thread is closed to new comments.