Regex Alms for the Perl-less?
December 28, 2007 2:12 AM

Help me composite some regex. (That phrasing makes this question sound way less nerdy than it is)

Here's a regex to find some stuff between square brackets:
/\[[^\]]+\]/

Here's one that finds something like Alt: or alt. or Alternative: or alternate or Alternate: or Alternative or Alt.: or you get the idea [note it ends with \s+; it is important for my application that the "Alt___ " I'm testing for has white space at the end, and that I test for it. In the final answer we can test for that word boundary any way we like, we just need to make sure that we do.]:
/(A|a)lt(\.|ernat(e|ive))?:?\s+/

So what I need is a regular expression for "stuff between square brackets where the first thing inside the brackets will NOT match the second regex." Or "Stuff inside square brackets that begins with anything BUT alt or Alt. or Alternate or alternative: or alt.: or etc. etc.

I feel like this should be easy, but I never bothered to totally and completely grok regex, and obviously I'm hurting now because of it. I'd very much appreciate any help anyone could give, and in exchange you'll get co-author credit for the amazing piece of software that this thing will ultimately be a part of! ;-)
posted by ChasFile to Computers & Internet (13 answers total) 4 users marked this as a favorite
You can (at least if it's like sed's regexes) rewrite the first one as:
/\[\([^\]]+\)\]/\1

And this will return anything within the parens (you can also use multiple sets of escaped parens, and reference them as \1, \2, \3). Then just pass it on to the next bit.

There's probably a way to chain them, too, but I think it's more readable to do one thing at a time. Just my personal preference.
posted by spaceman_spiff at 2:23 AM on December 28, 2007


Never mind, I missed the "not the second regex" part.
Go ahead and revoke my geek card now.
posted by spaceman_spiff at 2:26 AM on December 28, 2007


It can be done with the (?!...) syntax - google "negative lookahead assertion".
posted by Canard de Vasco at 2:36 AM on December 28, 2007


Which regex dialect are you using? sed? egrep? perl? VBScript? Something else?
posted by flabdablet at 3:23 AM on December 28, 2007


Looks like perl from the tags.
posted by beerbajay at 3:44 AM on December 28, 2007


Which regex dialect are you using?

Assume perl, but if you can get it going in one of the others, that's fine, too. I can translate it easily enough. Mostly right now I'm just looking for a way out of the woods; I can worry about trimming the hedges and edging the paths later.
posted by ChasFile at 4:13 AM on December 28, 2007


What Canard de Vasco said:
 /\[(?![Aa]lt(?:\.|ernat(?:e|ive))?:?\s+)[^\]]+\]/ 

posted by duckstab at 5:25 AM on December 28, 2007


Seconding Canard de Vasco, the perlre manual has a pretty good explanation. Mind the distinction between "negative lookahead" and "lookbehind", not much magic about the rest.


themel@sophokles:~$ perl -npe '$_ = /\[(?!(A|a)lt(\.|ernat(e|ive))?:?\s+)[^\]]+\]/ ?
"MATCH\n" : "NO MATCH\n";
'
foo
NO MATCH
[bar]
MATCH
[AltBar]
MATCH
[Alt: Foo]
NO MATCH
[Alternative: bar]

posted by themel at 5:26 AM on December 28, 2007


Do you require that the second part be not exactly like the first but is allowed to be slightly alternative? Check out this. I added carriage returns to break it up.

/
\[
(
[Aa]lt(?:\.|ernat(?:e|ive))?:
)
\]
\s+
\[(?!\1)\]
/

That finds a [Alt:] spaces [somethingElseNotExactlyTheSame]

If you don't care about the second part being exactly the same, but must not be ANY variant of the "alt" word, then duplicate that expression in place of \1.
posted by cmiller at 6:00 AM on December 28, 2007


Looks like people have already answered the post, but as a side-note I have to recommend The Regex Coach, which is essentially an interactive regex interpreter. Whenever I need to figure out a complicated regex for a piece of Perl I'm working on, I break this out and use it to come up with the regex and a bunch of test cases. Very helpful.
posted by kxr at 6:12 AM on December 28, 2007


There's also regex-tool, which supports interactive regex building within Emacs. It supports both Emacs-style and Perl-style regular expressions.
posted by harmfulray at 8:27 AM on December 28, 2007


There is also nothing wrong with a "get then filter" design pattern in this case. I find in most cases if I want to exclude X now, I end up excluding Y and Z down the road.
alts = re.compile(r'(A|a)lt(\.|ernat(e|ive))?:?\s+')
brackets = re.compile(r'\[[^\]]+\]')
list = [x for x in brackets.findall(r'[hh][Alternative ]') if (not alts.search(x))]

posted by KirkJobSluder at 9:00 AM on December 28, 2007


I would caution you to not include a regex so complex you have to ask about it here in actual production code. Anyone else who maintains the program will likely have to ask about it — or worse yet, misunderstand or break it. You might even forget how it works when you look at this code again in a few years.

While several people have given regexes that will work, I will encourage you to think about writing the code for clarity to humans. Unless you're in a tight loop for some high-performance code, an if/else statement and a couple of simpler regexes may be better in the long run.
posted by tomwheeler at 1:38 PM on December 29, 2007


« Older One USB charger to replace lots of wall warts?   |   Snakes, why did it have to be snakes? Newer »
This thread is closed to new comments.