Regular expression puzzle time!
May 4, 2009 1:14 PM   Subscribe

Regular expression puzzle time!

You have a chunk of HTML. You want the content of all of the table cells inside the table rows where the row has nine cells. You do not want cells for rows with more or less cells. Example content:

<tr class="evenrow">
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>

Example regex pattern that DOES NOT WORK:

$pattern = "/<tr class=\"\w{3,4}row\">.*?(<td.*?>.+?<\/td>){9}.*?<\/tr>/s";

Is there a way I can do this with one regular expression? Thanks!
posted by delladlux to Technology (9 answers total)
 
Your capturing group doesnt account for the newline. Put a \s+ or whatever and it should work.
posted by wongcorgi at 1:29 PM on May 4, 2009


Response by poster: wongcorgi: Where?

$pattern = "/<tr class=\"\w{3,4}row\">.*?(<td.*?>.+?<\/td>){9}.*?\s+?<\/tr>/s";

Still yields no results. Doing:

$pattern = "/<tr class=\"\w{3,4}row\">\s+(<td.*?>.+?<\/td>\s+){9}.*?\s+?<\/tr>/s";

Will give me the last (the ninth) cell, but I need all nine.
posted by delladlux at 1:37 PM on May 4, 2009


Also, if you want to get the data in one regex, you'll need to expand the {9} out into individual sets of <td>(.*?)</td>
posted by wongcorgi at 1:37 PM on May 4, 2009


Response by poster: I guess that's it- I always thought the {n} construct allowed you to say give me the the last n character strings/groupings that matched, vs just the last one in the set.
posted by delladlux at 1:41 PM on May 4, 2009


<tr([^>]*)>(\s?)(<td>([^<]*)</td>(\s?)){9}</tr>
posted by bricoleur at 1:43 PM on May 4, 2009


To get all nine, go from (pattern){9} to ((pattern){9}).

But your regex will match 9 or more cells. The the .+? in the cell data (and the .*? after) can match multiple open/close tags.

match 1 = <td>data</td>
[...]
match 9 = <td>data</td><td>data</td>

So you'd still have exactly nine matches of the pattern. You might be able to fix this with lookahead or lookbehind.
posted by lalas at 1:51 PM on May 4, 2009


Something like bricoleur's pattern will work only if your table cells contain no HTML tags. If that's always the case, go for it.

For a more robust regexp, there are two things you should to use:

1) Grouping without creating a backreference. This lets you group things within a group that *does* create a backreference:

((?:pattern){9})

This will match "pattern" nine times and return a single group with all nine matches in one string.

2) Atomic grouping. I learned about this just now (thanks!). Basically, as lalas says, you can run into problems where your wildcards match the </td><td> of a cell boundary, not knowing they are "supposed" to stop there. Atomic grouping lets you grab the first match you get, ending with the first </td> in match 9 instead of trying to continue to another from a later cell.

For example: /^(.*?a){2}$/ matches baba, like you'd expect, and also bababa. /^(?>.*?a){2}/ only matches baba, not bababa, as it grabs the first "ba" the group matches and never backtracks to find "baba" as a match for ".*?a" later.

So try this (hoping I can get the characters correct):

/<tr class=\"\w{3,4}row\">\s*((?:(?><td.*?>.*?<\/td>)\s*){9})\s*<\/tr>/

Basically, the (?>......) will take the very first matches of the lazy quantifiers inside it.

I've also included some whitespace matching in there (\s*) to make it a bit more flexible in that sense.

Note that this will fail if you have nested tables and your table cells contain other TD tags.

Now, take a look at that pattern. Notice how hideous it is? Please don't use it. Please. If you or anyone else ever needs to change this code or even figure out what the heck it's doing, that future person will hate you. And I think it works, but it's so ugly that I can't really analyze it well enough to be sure.

So there's an answer to your question. My actual advice is to make an array of the cell contents (something using split() in perl, for example) for each table row and skip the current row if the count isn't what you want. That code will be readable.
posted by whatnotever at 2:44 PM on May 4, 2009 [1 favorite]


Strictly speaking, perl regular expressions (which are among the most expressive REs) cannot handle the general case of the problem that you specify. Briefly, REs are great at doing lexical parsing, but they can be shown to be incapable of syntactic analysis. You may be able to produce an RE that can handle the HTML examples you have today, but it is possible for tomorrow's HTML input to not be recognized by the RE.

You'd be better off using something like the HTML::Parser module. Unfortunately, using it isn't easy but if you want a robust parser, you'll need to abandon the RE.
posted by fydfyd at 3:35 PM on May 4, 2009


Is there a way I can do this with one regular expression?

I doubt there is another question programmers have ever asked themselves that has resulted in as much eye-searing code as this one.

I've got to Nth the commenters above who recommend using a real HTML parser rather than a regex for this. You're just setting yourself (or an unlucky colleague) up for a maintenance nightmare later on.
posted by letourneau at 3:45 PM on May 4, 2009


« Older Help me find even more anime to watch!   |   Can I be chubby and still have a good time at the... Newer »
This thread is closed to new comments.