Regex question
January 21, 2007 11:52 AM   Subscribe

What's a regular expression to match and replace space characters between curly braces with under scores?

In other words, turn something like:

blah blah blah {blah blah blah} blah

into:

blah blah blah blah_blah_blah blah

There can be any number of words between the curly braces.
posted by pealco to Computers & Internet (20 answers total)
 
I could be wrong, but I don't think you can do that with a single regular expression. You'll have to first find the brace-enclosed sections and then run them through a section regex.

You'll probably be able to do this in a single line if you're able to process the search results with a bit of code (with the 'e' flag in Perl, for example).
posted by Khalad at 12:09 PM on January 21, 2007


in perl this code snippet should print out what you want: (usual disclaimers apply: untested, off the top of my head, etc...)

$string = "blah blah blah {blah blah blah} blah"

if($string~=/(.+)\{(.+)\}(.+)/){
    $prev = $1;
    $within_braces=$2
    $after = $3;
}
@temp = split(" ",$within_braces);

print $prev;

for($i=0;$i<(@temp-1);$i++){
    print $temp[$i] . "_";
}
print $temp[(@temp-1)];
print $after;
posted by chrisamiller at 12:30 PM on January 21, 2007


Yeah, I don't know if it can be done in a single RegEx. Here's relatively simple PHP code to do it:

function a( $b ) { return str_replace( ' ' , '_' , $b[1] ); }
$out = preg_replace_callback( '/{([^}]*)}/' , 'a' , $in );
posted by scottreynen at 12:32 PM on January 21, 2007


This might work in perl or with PCRE's (like php's preg_foo functions):

$str =~ s/(\{[^} ]+)( ))/$1_/g

The idea in english: capture an opening brace followed by any number of (but at least one) non-brace/non-space characters, and capture a space immediately following it. Replace this string with the first capture followed by an underscore. Repeat.

However, I'm not sure perl does subsequent search/replace operations on the iteratively changed string when you specify the global flag at the end. So you might have to do this:

while($str =~ s/(\{[^} ]+)( ))/$1_/) {}

The regex should also work with PCRE (preg_replace) in PHP in a similar way.

The callbacks mentioned earlier in the thread are probably a cleaner way of doing things, tho'.
posted by weston at 12:39 PM on January 21, 2007


However, I'm not sure perl does subsequent search/replace operations on the iteratively changed string when you specify the global flag at the end.

It does not. It would be nice if it did sometimes, but it would be too easy to get stuck in an infinite loop.
posted by Khalad at 1:09 PM on January 21, 2007


There are some things that regular expressions are really good at, but anything involving balanced matching is usually not one of those things.
posted by Rhomboid at 1:24 PM on January 21, 2007


awkfile:
BEGIN { regex = "{[^{]*}" }
match($0, regex) {
  substring1 = substr($0, RSTART, RLENGTH);
  gsub(/\ /, "_", substring1);
  gsub(/{|}/, "", substring1);
  sub(regex, substring1, $0);
  print $0;
}
Then:
awk -f awkfile inputfile
posted by ctmf at 1:34 PM on January 21, 2007


rats. replace line 1 with BEGIN { regex = "{[^}]*}" }
I reversed a bracket.
problems: eats lines that don't match, won't work with nested brackets. Otherwise works for me.
posted by ctmf at 1:39 PM on January 21, 2007


One more edit to the awk script - complete script:
BEGIN { regex = "{[^{}]*}" }
{
if (!(match($0, regex))) {
print $0;
}
}
match($0, regex) {
substring1 = substr($0, RSTART, RLENGTH);
gsub(/\ /, "_", substring1);
gsub(/{|}/, "", substring1);
sub(regex, substring1, $0);
print $0;
}

Now it doesn't eat non-matching lines, and gracefully handles nested brackets. However, if you have nested brackets, the script only handles the inner set. You have to run the script again on the output stream as many times as you had nesting levels.
posted by ctmf at 1:50 PM on January 21, 2007


Ok, last edit. Complete script:
BEGIN { regex = "{[^{}]*}" }
{
if (!(match($0, regex))) {
print $0;
next;
}
}
{
while (match($0, regex)) {
substring1 = substr($0, RSTART, RLENGTH);
gsub(/\ /, "_", substring1);
gsub(/{|}/, "", substring1);
sub(regex, substring1, $0);
}
print $0;
}

Reruns the substitution on nested bracket levels by itself. You can still break it by unbalancing your brackets, but, well, don't do that.
Thanks for the fun exercise.
posted by ctmf at 2:06 PM on January 21, 2007


You could do this with a regexp in a loop:
perl: while (s/({[^}]*) /$1_/g) {};

On the first pass, this replaces the first space in all {} sections. On the second pass, the first space is no longer a space, so the former-second space is now the first space and is replaced. And so on.
posted by aneel at 2:50 PM on January 21, 2007


Heh. Or basically "what weston said". Oops.
posted by aneel at 2:53 PM on January 21, 2007


Maybe I'm reading weston and aneel's perl wrong, but:
Where do the braces go? Wouldn't you get:

one two {three four five} six seven
to
one two {three_four_five} six seven

and nesting, which may not be an issue, depending on pealco's input file:

one two {three {four five} six} seven
to
one two {three_{four_five} six} seven?

I don't know enough about perl to try it, but it looks wrong to me. Obviously, I could be wrong.
posted by ctmf at 3:29 PM on January 21, 2007


#!/usr/bin/perl
# Usage: blah.pl blah.txt

sub foo {
        $t = shift;
        $t =~ s/\s/_/g;
        return $t;
}
while ( $line = <> ) {
        $line =~ s/\{([^\}]+)\}/foo($1)/ge;
        print $line;
}       
$ cat blah.txt
blah blah blah {blah blah blah} blah
blah {blah blah} blah {blah blah blah} blah
{bonk bonk} bonk {on the head} {bonk bonk}
$ ./blah.pl blah.txt
blah blah blah blah_blah_blah blah
blah blah_blah blah blah_blah_blah blah
bonk_bonk bonk on_the_head bonk_bonk

If you need to process nested braces then you'll need to write a parser, or a very hairy regex.
posted by zengargoyle at 4:19 PM on January 21, 2007


That's just a long way of writing perl -pe 's,{([^}]+)},($t = $1) =~ s/\s/_/g; $t,ge' blah.txt
posted by Rhomboid at 4:45 PM on January 21, 2007


Ok, adding the while loop to the awk script made the first form redundant. So now the complete script is:
BEGIN { regex = "{[^{}]*}" }
{
while (match($0, regex)) {
substring1 = substr($0, RSTART, RLENGTH);
gsub(/\ /, "_", substring1);
gsub(/{|}/, "", substring1);
sub(regex, substring1, $0);
}
print $0;
}

I swear I will stop looking at this now.
posted by ctmf at 4:57 PM on January 21, 2007


You can do this by adding the "execute" flag to have the right-hand-side processed as code:
$str = 'blah blah blah {blah blah blah} blah';$str =~ s!\{([^}]+)\}!$x=$1;$x=~s| |_|g;$x!ge;print $str;
doesn't do nesting, and isn't technically "one regex" but it's one line, which has to count for something.

For those people about to make line-noise jokes, I can't do anything about the fact that the pattern includes curlies, but this ought to help:
$str = 'blah blah blah {blah blah blah} blah';$str =~ s/             \{([^}]+)\} # replace:                         # opening curly bracket, string of at least                          # one non-curly, closing curly bracket         /                         # inline code does space-to-underscore                         # replacements on the match:             $x=$1;             $x=~s| |_|g;             $x;         # last line of inline code is $x, so that's the output         /xge;           # x allows the comments and spaces, g is global                         # and e is "execute RHS";

posted by AmbroseChapel at 5:09 PM on January 21, 2007


Response by poster: Wow, thanks everybody.

FWIW, the input file won't have any nesting. Also, the choice of curly brackets was arbitrary, it could be anything if it makes it easier.

This was supposed to be a quick time-saver. The input file requires underscores, but placing underscores between words is a little harder than placing brackets around the words.
posted by pealco at 6:05 PM on January 21, 2007


Maybe I'm reading weston and aneel's perl wrong, but: Where do the braces go?

No, you're not reading it wrong, I was reading the question wrong. The constraint that the braces need to disappear makes the problem quite a bit harder. At the moment, I'm not seeing an elegant way to handle it using just the looping idea. This works (and handles properly paired nested braces fine):
$ perl -pe 'while (s/{([^}]*) ([^}]*)}|{([^ }]*)}/$3 || "{$1_$2}"/ge) {};'
blah {blah blah blah} blah blah {blah {blah blah} blah} blah blah
blah blah_blah_blah blah blah blah_blah_blah_blah blah blah
But it's cheating by using /e to treat one regexp as two. Rhomboid's solution (which is the same as AmbroseChapel's, unless I'm mistaken*) is definitely more elegant and comprehensible.

* and considering my track record so far in this thread...
posted by aneel at 6:12 PM on January 21, 2007


Wow, nice work.
posted by ctmf at 10:45 PM on January 21, 2007


« Older I'm startled awake as soon as I fall asleep.   |   Songs, Movies, Novels with "genetic enrichment" Newer »
This thread is closed to new comments.