Perl regular expression question
December 21, 2005 2:04 PM   Subscribe

Perl regular expression question inside. Trying to parse a list of items...

Ok, I have a file listing things formatted like this:

Items foobar:
Items foobaz:

I'm trying to use a regular expression to determine whether a given item comes after foobar or after foobaz. I'm doing something like this:
$variable = "b3";
$text_of_file =~ /Items (\S+?):.*?$variable/;
print "$1\n";

I figured that adding a ? after the * to make it non-greedy would mean that it would print "foobaz", but unfortunately it's printing "foobar".

Can someone suggest a better way to do this? It occured to me that I could split the list up into sections using something like:
@sections = split(/foob\S\S/, $text_of_file)
but that seemed like a lame hack, and it seems like you should be able to easily do this using a regex.
posted by pornucopia to Computers & Internet (8 answers total)
Best answer: In essence, you're not telling Perl what you think you're telling it. You say "Find me an Items header, then whatever text you want (the .*), then my variable", which it does. What you really want is more like "Find me an Items header, then some text that is not an Items header, then my variable".

For your simple example, changing .* to [^I]* does the trick, but I don't know how to extend this to negative string matching offhand. I'd probably just look for the variable, then take $PREMATCH and look for the last items header in that, but I remember reading that this is a bad idea, performance-wise.

While you call the split approach a lame hack, I wouldn't really dismiss it - it may well be faster than the complicated regex approach, and maintaining such code is mostly easier than peeling apart giant monster regexes.

If you care about performance, I recommend some toying with to find out real numbers, though it's probably not going to save a program that reads its data from text files line by line.
posted by themel at 2:28 PM on December 21, 2005

Couple things here:

1) You have newlines between the items? I'm surprised this works at all without the 's' modifier.

2) You really should put \Q$variable\E in there to prevent the value of $variable from poluting your regex (double interpolation happens inside regular expressions in perl).

3) Your misunderstanding how greedy works. Your regex matches the Items line first, and then matches as many characters as possible until it reaches $variable. ".*?" affects more how the next atom matches, not how the previous ones do.

I'll think about it some more, but this is really a problem for a state machine.
posted by sbutler at 2:34 PM on December 21, 2005

b3 comes after foobar and foobaz. Once you recognize this, you will generate a working regex, something like themel's.

There's a bazillion ways to skin this one; one way might be to scan all lines, creating strings like

Item foobar: a1 a2 a3 a4
Item foobaz: b1 b2 b3

then using grep() to find associations. You can get fancier using dictionaries and lists.

The best way to do it depends on the larger problem you're trying to solve.
posted by ldenneau at 2:40 PM on December 21, 2005

Response by poster: The larger problem: I'm really just trying to find the simplest way to determine whether a given variable (a2, b3, whatever) is listed under foobar, or under foobaz. The list might expand in the future to include other headings, but they'll all be formatted the same. "Item foobum:"
posted by pornucopia at 2:50 PM on December 21, 2005

Best answer:
my $section;my $variable = 'b3';my $found = 0;foreach (split /\n/s, $text_of_file) {    if (/^Item\s+(.+?):$/) {        $section = $1;    } elsif ($_ eq $variable) {        $found = 1;        last;    }}
This is the best way to solve this problem.
posted by sbutler at 2:56 PM on December 21, 2005

perl -lane '$section = $F[1] if $F[0] eq "Items"; print $section if $_ eq "b3"'
posted by nicwolff at 4:14 PM on December 21, 2005

Somewhat simpler...
Change your :

Items (\S+?):.*?$variable

to :

.*Items (\S+?):.*?$variable

This makes it greedily consume all the string up until the last "section".

This is somewhat more general than themel's suggestion, however, I cannot speak to what the performance difference would be.

Also, I highly recommend, regex-coach for all regex work. It's an interactive graphical tool that makes it very easy to see what is going on.
posted by ill3 at 6:35 PM on December 21, 2005

I use split() constantly and successfully to help parse the contents of big text files of all kinds, especially HTML files. Reading files a line at a time is for suckers. Slurp it in and split() it until it tells you what you want to know.

but that seemed like a lame hack, and it seems like you should be able to easily do this using a regex.

Remember, when you use split(), you are using a regex, just in a useful shorthand way. When you combine grep(), map(), and split(), there's a lot you can do without bothering to create laborious loops or assign a lot of temporary variables.

Here's a one-liner to solve the original problem:

my ($label) = split(/:/, (grep { /$variable/ } split(/Items /, $text_of_file))[0]);

$label should contain "foobaz".
posted by staggernation at 8:49 PM on December 21, 2005

« Older 2 small KDE usability questions   |   speaking spanish quickly Newer »
This thread is closed to new comments.