Comments on: Regex Alms for the Perl-less?

Question: Regex Alms for the Perl-less?

ChasFile — Fri, 28 Dec 2007 02:12:28 -0800

Help me composite some regex. (That phrasing makes this question sound way less nerdy than it is)

Here's a regex to find some stuff between square brackets:
/\[[^\]]+\]/

Here's one that finds something like Alt: or alt. or Alternative: or alternate or Alternate: or Alternative or Alt.: or you get the idea [note it ends with \s+; it is important for my application that the "Alt___ " I'm testing for has white space at the end, and that I test for it. In the final answer we can test for that word boundary any way we like, we just need to make sure that we do.]:
/(A|a)lt(\.|ernat(e|ive))?:?\s+/

So what I need is a regular expression for "stuff between square brackets where the first thing inside the brackets will NOT match the second regex." Or "Stuff inside square brackets that begins with anything BUT alt or Alt. or Alternate or alternative: or alt.: or etc. etc.

I feel like this should be easy, but I never bothered to totally and completely grok regex, and obviously I'm hurting now because of it. I'd very much appreciate any help anyone could give, and in exchange you'll get co-author credit for the amazing piece of software that this thing will ultimately be a part of! ;-)

By: spaceman_spiff

spaceman_spiff — Fri, 28 Dec 2007 02:23:16 -0800

You can (at least if it's like sed's regexes) rewrite the first one as:
/\[\([^\]]+\)\]/\1

And this will return anything within the parens (you can also use multiple sets of escaped parens, and reference them as \1, \2, \3). Then just pass it on to the next bit.

There's probably a way to chain them, too, but I think it's more readable to do one thing at a time. Just my personal preference.

By: spaceman_spiff

spaceman_spiff — Fri, 28 Dec 2007 02:26:24 -0800

Never mind, I missed the "not the second regex" part.
Go ahead and revoke my geek card now.

By: Canard de Vasco

Canard de Vasco — Fri, 28 Dec 2007 02:36:59 -0800

It can be done with the (?!...) syntax - google "negative lookahead assertion".

By: flabdablet

flabdablet — Fri, 28 Dec 2007 03:23:36 -0800

Which regex dialect are you using? sed? egrep? perl? VBScript? Something else?

By: beerbajay

beerbajay — Fri, 28 Dec 2007 03:44:57 -0800

Looks like perl from the tags.

By: ChasFile

ChasFile — Fri, 28 Dec 2007 04:13:24 -0800

Which regex dialect are you using?

Assume perl, but if you can get it going in one of the others, that's fine, too. I can translate it easily enough. Mostly right now I'm just looking for a way out of the woods; I can worry about trimming the hedges and edging the paths later.

By: duckstab

duckstab — Fri, 28 Dec 2007 05:25:08 -0800

What Canard de Vasco said:

 /\[(?![Aa]lt(?:\.|ernat(?:e|ive))?:?\s+)[^\]]+\]/

By: themel

themel — Fri, 28 Dec 2007 05:26:11 -0800

Seconding Canard de Vasco, the perlre manual has a pretty good explanation. Mind the distinction between "negative lookahead" and "lookbehind", not much magic about the rest.



themel@sophokles:~$ perl -npe '$_ = /\[(?!(A|a)lt(\.|ernat(e|ive))?:?\s+)[^\]]+\]/ ? 

"MATCH\n" : "NO MATCH\n";

'

foo

NO MATCH

[bar]

MATCH

[AltBar]

MATCH

[Alt: Foo]

NO MATCH

[Alternative: bar]

By: cmiller

cmiller — Fri, 28 Dec 2007 06:00:55 -0800

Do you require that the second part be not exactly like the first but is allowed to be slightly alternative? Check out this. I added carriage returns to break it up.

/
\[
(
[Aa]lt(?:\.|ernat(?:e|ive))?:
)
\]
\s+
\[(?!\1)\]
/

That finds a [Alt:] spaces [somethingElseNotExactlyTheSame]

If you don't care about the second part being exactly the same, but must not be ANY variant of the "alt" word, then duplicate that expression in place of \1.

By: kxr

kxr — Fri, 28 Dec 2007 06:12:06 -0800

Looks like people have already answered the post, but as a side-note I have to recommend The Regex Coach, which is essentially an interactive regex interpreter. Whenever I need to figure out a complicated regex for a piece of Perl I'm working on, I break this out and use it to come up with the regex and a bunch of test cases. Very helpful.

By: harmfulray

harmfulray — Fri, 28 Dec 2007 08:27:37 -0800

There's also regex-tool, which supports interactive regex building within Emacs. It supports both Emacs-style and Perl-style regular expressions.

By: KirkJobSluder

KirkJobSluder — Fri, 28 Dec 2007 09:00:40 -0800

There is also nothing wrong with a "get then filter" design pattern in this case. I find in most cases if I want to exclude X now, I end up excluding Y and Z down the road.

alts = re.compile(r'(A|a)lt(\.|ernat(e|ive))?:?\s+')

brackets = re.compile(r'\[[^\]]+\]')

list = [x for x in brackets.findall(r'[hh][Alternative ]') if (not alts.search(x))]

By: tomwheeler

tomwheeler — Sat, 29 Dec 2007 13:38:50 -0800

I would caution you to not include a regex so complex you have to ask about it here in actual production code. Anyone else who maintains the program will likely have to ask about it — or worse yet, misunderstand or break it. You might even forget how it works when you look at this code again in a few years.

While several people have given regexes that will work, I will encourage you to think about writing the code for clarity to humans. Unless you're in a tight loop for some high-performance code, an if/else statement and a couple of simpler regexes may be better in the long run.