Perl regex hell
September 17, 2010 2:38 PM   Subscribe

I have been hammering Google for about three hours now trying to find the answer to what I expected to be a simple question: WTF is up with regular expression backreferences in Perl? I cannot make this (simple) find-replace work.

Trying to automate some file processing, stripping out junk in text output files from a data collection program and turning them into .csv files.

I was doing this via an AppleScript, calling TextWrangler to do the grunt work using grep regex find/replace, and it worked well - but it was really slow once I compiled the AppleScript, so I am trying to use a call to a shell script instead to speed things up.

The code in question:
sub replaceString {
  my ($search, $replace) = @_;
    if( s/$search/$replace/ig ) {
        print;
    }
    else {
        print;
  }
}


When I pass it text as the search and replace arguments (ie, find "foo" and replace with "bar") it works. When I pass it arguments using backreferences, it doesn't work, even though the same argument in TextWrangler works just fine.

Specific example: I need it to strip an initial 0 off of some timestamps in the file. If I use "\ +0(\d\d:)" and "\1" for my find and replace values, the zero is replaced by "\1" - literally. "045:" is changed to "\145:". It doesn't work using dollar signs either. I've spent most of my afternoon trying to figure out how to do this, but every damn result I find in searching just tells me how to use regex, not how to pass regex group elements from one match to the next. Perl apparently just forgets all regex values the second it starts interpreting a new match?

I can't figure out why a program designed to work with text is so obstinately stupid when it comes to something as simple as a find-replace using pattern matching. There must be a really simple way to do this.

(Notes: the subroutine will be called to replace a long string of matches, sequentially, so I need to be able to keep $search and $replace as undefined args passed by the main routine. In the end I need to be able to package this as a bundled app that anyone in my lab can run on the lab Macs [the reason I compiled the script], so I can't get too crazy with how this thing works.)
posted by caution live frogs to Computers & Internet (14 answers total)
 
I'm taking a stab in the dark here, and guessing the \1 is only a valid back reference when naked, and becomes literal text when passed in as part of a text string. I can't elaborate on why that would be, but it fits your observations and seems somewhat reasonable.

If that's the case, you can probably fix it by creating a literal text string of your s/// construct by using your passed string values, and then running it through eval().
posted by devbrain at 2:48 PM on September 17, 2010


At least in Perl, you need to use $1 instead of \1.
posted by jozxyqk at 2:49 PM on September 17, 2010 [1 favorite]


Use '$1' instead of '\1' and do s/$search/$replace/eegi instead.

Also, I hope your replaceString is a little more complex. Otherwise it would be just as easy to do the s/// instead of making the function call.
posted by sbutler at 2:50 PM on September 17, 2010


Best answer: Specifically, this works for me:
sub replaceString {
    my ($s, $r) = @_;

    if (s/$s/$r/eeig) {
        print;
    } else {
        print;
    }
}

$_ = "045:\n";
replaceString qr/0(\d\d:)/, '\1';

posted by sbutler at 2:51 PM on September 17, 2010


Arg... not \1. $1. $1. $1.
posted by sbutler at 2:55 PM on September 17, 2010


I see some other folks have chimed in while I was writing a small snippet to demonstrate that eval works.

Other changes -- I've changed your "\ " to a "\s" which I think is more readable. You might be better yet with a \b to match a word boundary - do these leading zeros never occur at the start of a line, but always after whitespace? Your regex as given will strip the leading zero AND leading whitespace -- if that's not what you want then use \b.

The \1 is better written as $1 (per perl's warnings)

There's no need for a then/else to both contain the print


#!/usr/bin/perl -w

sub replaceString {
my ($search, $replace) = @_;

my $str = qq! s/$search/$replace/ig; !;
eval $str;
print;

}

while (<>) {
replaceString('\s+0(\d\d:)', '$1');
}


posted by devbrain at 3:01 PM on September 17, 2010


Also, I'm not sure I understand how you expect things to work here:

I've spent most of my afternoon trying to figure out how to do this, but every damn result I find in searching just tells me how to use regex, not how to pass regex group elements from one match to the next. Perl apparently just forgets all regex values the second it starts interpreting a new match?

If you mean inside of $replace, then you're misunderstanding the terminology. What's in there is not another match, it's a replacement, and $1/$2/etc are certainly valid in this context.

If you mean at the start of another s/// or m//, then yes, of course they get reset. That's the only reasonable thing to do. If you want to save $1/$2/etc from a previous match then you need to do that yourself. It's pretty easy:
push @matches, [$0,$1,$2,$3,$4,$5,$6,$7,$8,$9];
Then the first capture from the previous match is $matches[-1][1];
posted by sbutler at 3:02 PM on September 17, 2010


Best answer: @devbrain while that might work, I have to disagree that it is a good solution. eval on strings should be used sparingly and as restrictively as possible. And in this case, the perl developers added an explicit option to s/// to solve the problem. From perlops:
e   Evaluate the right side as an expression.
ee  Evaluate the right side as a string then eval the result
Clearly they intended for us to use 'ee' in this case.
posted by sbutler at 3:07 PM on September 17, 2010


If you're just stripping, you can use positive lookahead:
replaceString('\s+0(?=\d\d:)', '');

posted by rhizome at 3:11 PM on September 17, 2010


FWIW, within a pattern, use \1 as a backreference to an earlier part of the same pattern, to match a duplicate of something that an earlier group matched. Within a replacement or just plain ol' Perl code, use $1, which is a variable holding what a group matched. Subtle difference (and muddied by the fact that sed uses \1 for both).

As for why it's not doing what you want, the answer is the 'ee' suffix that sbutler and devbrain mention. It would be really really annoying if Perl did what you wanted all the time; you normally don't want the contents of text variables to be re-parsed as Perl code repeatedly. If there happens to be a $ in your replacement text you don't want that to get interpolated as a perl variable. Usually.
$a = 'foo $1 bar';
$b = qr/"(.*)"/;
$c = 'I say "ok"';

print "Before: [$c]\n";
$c =~ s/$b/qq(qq($a))/ee;
print "After: [$c]\n";
The 'e' suffix evaluates as perl code, but what you want to do is two rounds of string interpolation, so you wrap the RHS in two doublequote operators.

An IMHO less weird and more perlish way to do this would be to pass in a code reference to compute the substitution string:
$a = sub { "foo $1 bar" };
$b = qr/"(.*)"/;
$c = 'I say "ok"';

print "Before: [$c]\n";
$c =~ s/$b/&$a/e;
print "After: [$c]\n";
The big benefit of doing this is that you don't make any assumptions about the content of the replacement string— in the first example, if $a had a right-paren in it, it could mess up the second eval. In the second example, there's no multiple-eval weirdness going on, just a subroutine call.
posted by hattifattener at 3:28 PM on September 17, 2010


Everyone's being very polite, and I don't want to spoil the mood, but you asked this question in a very combative, negative way.

Another thing to bear in mind: try perldoc before hammering google to find out about perl obstinacy.
posted by AmbroseChapel at 5:07 PM on September 17, 2010


Just to chime in. It sounds like you're neither using strict or using warnings;

make sure your script starts like this:


#!/usr/bin/env perl
use warnings;
use strict;

# do stuff here ...


Doing so will eliminate much of your pain, and allow you to ask your question more clearly. Lots of good perl answers, and excellent place to ask questions here
posted by singingfish at 5:32 PM on September 17, 2010


Response by poster: Well, damn. ee it is. That finally makes it work - which is generating more headaches for me (because other parts of the script are giving unintended results) but it is finally replacing things as I expected it to do. I might get this finished this week after all.

AmbroseChapel - the question was written in frustration. I'm combative and negative because I have to do this in the first place - our data collection setup returns a text file that used to take a metric ton of hand copy-pasting just to get the raw data in a usable form. Scripting this is my attempt to make up for what the program authors should have done in the first place for software and systems that cost tens of thousands of dollars. I'm also pretty annoyed that I couldn't figure it out on my own to begin with, despite spending a lot of time searching (and yes, a lot of that searching was in the Perl documentation).

singingfish - that was a code snippet. The strict and warning flags are set in the file itself. No warnings were generated from the script though as it did what it was written to do - replaced a search result with "\1". Once I added the ee flag it works as I expected, and is now generating warnings about the $replace variable - hence my new headaches. But I think I'm on the right track.

Thanks, all!
posted by caution live frogs at 6:38 PM on September 17, 2010


> the question was written in frustration. I'm combative and negative because I have to do this in the first place

It might have been better, then, to ask a more general question about the task or software instead. This might well turn out to be an X-Y problem, or have a solution in a CPAN module. You're reformatting dates and times by means of raw string manipulation, when some module might be way better.

Can you give us an overview? What's the big picture?
posted by AmbroseChapel at 8:36 PM on September 17, 2010


« Older froze and thawed a bottle of coca cola. now half...   |   Goopy tub equals sad. Newer »
This thread is closed to new comments.