Regex crisis...
October 30, 2008 10:02 AM   Subscribe

Regular Expressions / PHP preg_replace problem. Can this be made to work...?

I've got a long string of text that will contain something like this...
  <container>aa<alpha>t</alpha>bb<beta>i</beta>dd</container>

I want to have a nice little bit of regex that plucks the 't' from alpha and the 'i' from beta. In itself okay. Except I want it to match correctly if it finds
  <container>aa<beta>i</beta>bb<alpha>t</alpha>dd</container>
and for it to get what it can if someone misses a field, like this...
  <container>aa<alpha>t</alpha>bb</container>

My basic knowledge of regexes suggests the following should have worked, but as you can guess from my question, it doesn't.
  @<container>([^<>]*)((<alpha>(.*?)</alpha>)|(<beta>(.*?)</beta>)|(^<>]*?))*([^<>]*)</container>@mi

Help...?
posted by twine42 to Computers & Internet (13 answers total)
 
I've approached similar problems by using XML parsers. Can you do something like that, or do you need to use regex?
posted by niles at 10:22 AM on October 30, 2008


You might be better off using Simple XML or one of the other batteries-included sorts of tools than mucking about with regex...
posted by brennen at 10:24 AM on October 30, 2008 [1 favorite]


Best answer: There's no fundamental reason this won't work. I don't know PHP syntax for regular expressions, but I'm assuming it's basically Perlish. Try this-

@<container>((<alpha>(.*?)</alpha>)|(<beta>(.*?)</beta>)|([^<>]*))*</container>@mi

Note that this won't work if there are elements other than alpha and beta inside of .

In general, regular expressions aren't powerful enough to do very much with XML, for the same reason you can't use them to match arbitrary-length palindromes. You need a stack.

posted by qxntpqbbbqxl at 10:39 AM on October 30, 2008


Seconding a parser, but if you want to stick with regex you shouldn't have to account for the whole line and you shouldn't need to use alternation. I'd go with something like:
(<alpha>(.*?)</alpha>)?(<beta>(.*?)</beta>)?
Also not sure what you're trying to get out, or in what form. If you want to keep the values separate and predictable it may be better to split the checks for alpha and beta.
posted by rhizome at 10:51 AM on October 30, 2008


does this have to be done in one regex or can you scan your input twice - once for alpha, and once for beta ? Your regex(es) would be less complicated and you wouldn't have to worry about missing or extraneous fields in your input.
posted by xbonesgt at 10:52 AM on October 30, 2008


Download a tool like regexbuddy and play around with it. It makes regexes a breeze and it will create the code for almost any language.
posted by wongcorgi at 10:53 AM on October 30, 2008


I don't know PHP syntax for regular expressions, but I'm assuming it's basically Perlish.

As far as I know, the preg_* functions use PRCRE, which means that they're theoretically Perlish but you don't want to make too many assumptions.
posted by brennen at 10:55 AM on October 30, 2008


Argh. PCRE.
posted by brennen at 10:55 AM on October 30, 2008


Seriously, SimpleXML is the answer (unless the input just looks like xml but isn't).
posted by and hosted from Uranus at 11:00 AM on October 30, 2008


Response by poster: I've got a table with data in it and another with regexes in it. As the data gets thrown to the screen it gets hit by the regexes to format it sensibly and do any data manipulation I need.

So, yeah, I'd prefer to keep it as a regex if I can.

Oh, the aa, bb, cc was there in my test as a placeholder for potential whitespace stuff so I could see it in test preg_match.
posted by twine42 at 11:04 AM on October 30, 2008


How will you make sure you replace the right matches if the fields can be in any order, or do you just want to replace all of them with the same string or get rid of all their content? I have found testing regexes with a print_r of a preg_split can be very helpful.

And I guess the slashes should be escaped with backslashes \/ ?
posted by dnial at 11:34 AM on October 30, 2008


Carefully reading your regex. I assume that instead of this

@<container>([^<>]*)((<alpha>(.*?)</alpha>)|(<beta>(.*?)</beta>)|(^<>]*?))*([^<>]*)</container>@mi

You meant to type something like this
@
<container>
  ([^<>]*)
  (
     (<alpha>(.*?)</alpha>)  |  (<beta>(.*?)</beta>)  |  ([^<>]*?)
  )*
  ([^<>]*) # First square bracket is missing!
</container>@mi
I've broken it up and annotated it as if I were using Perl's /x switch. I don't know PHP.
posted by I_pity_the_fool at 1:45 PM on October 30, 2008


What I've typed works in Regex Coach (another fine program). You can swap and delete the <alpha> and <beta> tags as well.

Also, do yourself a favour and use named captures and your equivalent of the /x switch if your dialect supports them. It'll probably save you some squinting time in 6 months.
posted by I_pity_the_fool at 1:48 PM on October 30, 2008


« Older CatFilter - is adult food OK for an older kitten?   |   How would one get a travel guide published? Newer »
This thread is closed to new comments.