Regex question
January 21, 2007 11:52 AM Subscribe
What's a regular expression to match and replace space characters between curly braces with under scores?
In other words, turn something like:
blah blah blah {blah blah blah} blah
into:
blah blah blah blah_blah_blah blah
There can be any number of words between the curly braces.
In other words, turn something like:
blah blah blah {blah blah blah} blah
into:
blah blah blah blah_blah_blah blah
There can be any number of words between the curly braces.
in perl this code snippet should print out what you want: (usual disclaimers apply: untested, off the top of my head, etc...)
$string = "blah blah blah {blah blah blah} blah"
if($string~=/(.+)\{(.+)\}(.+)/){
$prev = $1;
$within_braces=$2
$after = $3;
}
@temp = split(" ",$within_braces);
print $prev;
for($i=0;$i<(@temp-1);$i++){
print $temp[$i] . "_";
}
print $temp[(@temp-1)];
print $after;
posted by chrisamiller at 12:30 PM on January 21, 2007
$string = "blah blah blah {blah blah blah} blah"
if($string~=/(.+)\{(.+)\}(.+)/){
$prev = $1;
$within_braces=$2
$after = $3;
}
@temp = split(" ",$within_braces);
print $prev;
for($i=0;$i<(@temp-1);$i++){
print $temp[$i] . "_";
}
print $temp[(@temp-1)];
print $after;
posted by chrisamiller at 12:30 PM on January 21, 2007
Yeah, I don't know if it can be done in a single RegEx. Here's relatively simple PHP code to do it:
function a( $b ) { return str_replace( ' ' , '_' , $b[1] ); }
$out = preg_replace_callback( '/{([^}]*)}/' , 'a' , $in );
posted by scottreynen at 12:32 PM on January 21, 2007
function a( $b ) { return str_replace( ' ' , '_' , $b[1] ); }
$out = preg_replace_callback( '/{([^}]*)}/' , 'a' , $in );
posted by scottreynen at 12:32 PM on January 21, 2007
This might work in perl or with PCRE's (like php's preg_foo functions):
$str =~ s/(\{[^} ]+)( ))/$1_/g
The idea in english: capture an opening brace followed by any number of (but at least one) non-brace/non-space characters, and capture a space immediately following it. Replace this string with the first capture followed by an underscore. Repeat.
However, I'm not sure perl does subsequent search/replace operations on the iteratively changed string when you specify the global flag at the end. So you might have to do this:
while($str =~ s/(\{[^} ]+)( ))/$1_/) {}
The regex should also work with PCRE (preg_replace) in PHP in a similar way.
The callbacks mentioned earlier in the thread are probably a cleaner way of doing things, tho'.
posted by weston at 12:39 PM on January 21, 2007
$str =~ s/(\{[^} ]+)( ))/$1_/g
The idea in english: capture an opening brace followed by any number of (but at least one) non-brace/non-space characters, and capture a space immediately following it. Replace this string with the first capture followed by an underscore. Repeat.
However, I'm not sure perl does subsequent search/replace operations on the iteratively changed string when you specify the global flag at the end. So you might have to do this:
while($str =~ s/(\{[^} ]+)( ))/$1_/) {}
The regex should also work with PCRE (preg_replace) in PHP in a similar way.
The callbacks mentioned earlier in the thread are probably a cleaner way of doing things, tho'.
posted by weston at 12:39 PM on January 21, 2007
However, I'm not sure perl does subsequent search/replace operations on the iteratively changed string when you specify the global flag at the end.
It does not. It would be nice if it did sometimes, but it would be too easy to get stuck in an infinite loop.
posted by Khalad at 1:09 PM on January 21, 2007
It does not. It would be nice if it did sometimes, but it would be too easy to get stuck in an infinite loop.
posted by Khalad at 1:09 PM on January 21, 2007
There are some things that regular expressions are really good at, but anything involving balanced matching is usually not one of those things.
posted by Rhomboid at 1:24 PM on January 21, 2007
posted by Rhomboid at 1:24 PM on January 21, 2007
awkfile:
awk -f awkfile inputfile
posted by ctmf at 1:34 PM on January 21, 2007
BEGIN { regex = "{[^{]*}" } match($0, regex) { substring1 = substr($0, RSTART, RLENGTH); gsub(/\ /, "_", substring1); gsub(/{|}/, "", substring1); sub(regex, substring1, $0); print $0; }Then:
awk -f awkfile inputfile
posted by ctmf at 1:34 PM on January 21, 2007
rats. replace line 1 with BEGIN { regex = "{[^}]*}" }
I reversed a bracket.
problems: eats lines that don't match, won't work with nested brackets. Otherwise works for me.
posted by ctmf at 1:39 PM on January 21, 2007
I reversed a bracket.
problems: eats lines that don't match, won't work with nested brackets. Otherwise works for me.
posted by ctmf at 1:39 PM on January 21, 2007
One more edit to the awk script - complete script:
BEGIN { regex = "{[^{}]*}" }
{
if (!(match($0, regex))) {
print $0;
}
}
match($0, regex) {
substring1 = substr($0, RSTART, RLENGTH);
gsub(/\ /, "_", substring1);
gsub(/{|}/, "", substring1);
sub(regex, substring1, $0);
print $0;
}
Now it doesn't eat non-matching lines, and gracefully handles nested brackets. However, if you have nested brackets, the script only handles the inner set. You have to run the script again on the output stream as many times as you had nesting levels.
posted by ctmf at 1:50 PM on January 21, 2007
BEGIN { regex = "{[^{}]*}" }
{
if (!(match($0, regex))) {
print $0;
}
}
match($0, regex) {
substring1 = substr($0, RSTART, RLENGTH);
gsub(/\ /, "_", substring1);
gsub(/{|}/, "", substring1);
sub(regex, substring1, $0);
print $0;
}
Now it doesn't eat non-matching lines, and gracefully handles nested brackets. However, if you have nested brackets, the script only handles the inner set. You have to run the script again on the output stream as many times as you had nesting levels.
posted by ctmf at 1:50 PM on January 21, 2007
Ok, last edit. Complete script:
BEGIN { regex = "{[^{}]*}" }
{
if (!(match($0, regex))) {
print $0;
next;
}
}
{
while (match($0, regex)) {
substring1 = substr($0, RSTART, RLENGTH);
gsub(/\ /, "_", substring1);
gsub(/{|}/, "", substring1);
sub(regex, substring1, $0);
}
print $0;
}
Reruns the substitution on nested bracket levels by itself. You can still break it by unbalancing your brackets, but, well, don't do that.
Thanks for the fun exercise.
posted by ctmf at 2:06 PM on January 21, 2007
BEGIN { regex = "{[^{}]*}" }
{
if (!(match($0, regex))) {
print $0;
next;
}
}
{
while (match($0, regex)) {
substring1 = substr($0, RSTART, RLENGTH);
gsub(/\ /, "_", substring1);
gsub(/{|}/, "", substring1);
sub(regex, substring1, $0);
}
print $0;
}
Reruns the substitution on nested bracket levels by itself. You can still break it by unbalancing your brackets, but, well, don't do that.
Thanks for the fun exercise.
posted by ctmf at 2:06 PM on January 21, 2007
You could do this with a regexp in a loop:
perl:
On the first pass, this replaces the first space in all {} sections. On the second pass, the first space is no longer a space, so the former-second space is now the first space and is replaced. And so on.
posted by aneel at 2:50 PM on January 21, 2007
perl:
while (s/({[^}]*) /$1_/g) {};
On the first pass, this replaces the first space in all {} sections. On the second pass, the first space is no longer a space, so the former-second space is now the first space and is replaced. And so on.
posted by aneel at 2:50 PM on January 21, 2007
Maybe I'm reading weston and aneel's perl wrong, but:
Where do the braces go? Wouldn't you get:
one two {three four five} six seven
to
one two {three_four_five} six seven
and nesting, which may not be an issue, depending on pealco's input file:
one two {three {four five} six} seven
to
one two {three_{four_five} six} seven?
I don't know enough about perl to try it, but it looks wrong to me. Obviously, I could be wrong.
posted by ctmf at 3:29 PM on January 21, 2007
Where do the braces go? Wouldn't you get:
one two {three four five} six seven
to
one two {three_four_five} six seven
and nesting, which may not be an issue, depending on pealco's input file:
one two {three {four five} six} seven
to
one two {three_{four_five} six} seven?
I don't know enough about perl to try it, but it looks wrong to me. Obviously, I could be wrong.
posted by ctmf at 3:29 PM on January 21, 2007
#!/usr/bin/perl
# Usage: blah.pl blah.txt
sub foo {
$t = shift;
$t =~ s/\s/_/g;
return $t;
}
while ( $line = <> ) {
$line =~ s/\{([^\}]+)\}/foo($1)/ge;
print $line;
}
>
$ cat blah.txt
blah blah blah {blah blah blah} blah
blah {blah blah} blah {blah blah blah} blah
{bonk bonk} bonk {on the head} {bonk bonk}
$ ./blah.pl blah.txt
blah blah blah blah_blah_blah blah
blah blah_blah blah blah_blah_blah blah
bonk_bonk bonk on_the_head bonk_bonk
If you need to process nested braces then you'll need to write a parser, or a very hairy regex.
posted by zengargoyle at 4:19 PM on January 21, 2007
That's just a long way of writing perl -pe 's,{([^}]+)},($t = $1) =~ s/\s/_/g; $t,ge' blah.txt
posted by Rhomboid at 4:45 PM on January 21, 2007
posted by Rhomboid at 4:45 PM on January 21, 2007
Ok, adding the while loop to the awk script made the first form redundant. So now the complete script is:
BEGIN { regex = "{[^{}]*}" }
{
while (match($0, regex)) {
substring1 = substr($0, RSTART, RLENGTH);
gsub(/\ /, "_", substring1);
gsub(/{|}/, "", substring1);
sub(regex, substring1, $0);
}
print $0;
}
I swear I will stop looking at this now.
posted by ctmf at 4:57 PM on January 21, 2007
BEGIN { regex = "{[^{}]*}" }
{
while (match($0, regex)) {
substring1 = substr($0, RSTART, RLENGTH);
gsub(/\ /, "_", substring1);
gsub(/{|}/, "", substring1);
sub(regex, substring1, $0);
}
print $0;
}
I swear I will stop looking at this now.
posted by ctmf at 4:57 PM on January 21, 2007
You can do this by adding the "execute" flag to have the right-hand-side processed as code:
For those people about to make line-noise jokes, I can't do anything about the fact that the pattern includes curlies, but this ought to help:
posted by AmbroseChapel at 5:09 PM on January 21, 2007
$str = 'blah blah blah {blah blah blah} blah';$str =~ s!\{([^}]+)\}!$x=$1;$x=~s| |_|g;$x!ge;print $str;doesn't do nesting, and isn't technically "one regex" but it's one line, which has to count for something.
For those people about to make line-noise jokes, I can't do anything about the fact that the pattern includes curlies, but this ought to help:
$str = 'blah blah blah {blah blah blah} blah';$str =~ s/ \{([^}]+)\} # replace: # opening curly bracket, string of at least # one non-curly, closing curly bracket / # inline code does space-to-underscore # replacements on the match: $x=$1; $x=~s| |_|g; $x; # last line of inline code is $x, so that's the output /xge; # x allows the comments and spaces, g is global # and e is "execute RHS";
posted by AmbroseChapel at 5:09 PM on January 21, 2007
Response by poster: Wow, thanks everybody.
FWIW, the input file won't have any nesting. Also, the choice of curly brackets was arbitrary, it could be anything if it makes it easier.
This was supposed to be a quick time-saver. The input file requires underscores, but placing underscores between words is a little harder than placing brackets around the words.
posted by pealco at 6:05 PM on January 21, 2007
FWIW, the input file won't have any nesting. Also, the choice of curly brackets was arbitrary, it could be anything if it makes it easier.
This was supposed to be a quick time-saver. The input file requires underscores, but placing underscores between words is a little harder than placing brackets around the words.
posted by pealco at 6:05 PM on January 21, 2007
Maybe I'm reading weston and aneel's perl wrong, but: Where do the braces go?
No, you're not reading it wrong, I was reading the question wrong. The constraint that the braces need to disappear makes the problem quite a bit harder. At the moment, I'm not seeing an elegant way to handle it using just the looping idea. This works (and handles properly paired nested braces fine):
* and considering my track record so far in this thread...
posted by aneel at 6:12 PM on January 21, 2007
No, you're not reading it wrong, I was reading the question wrong. The constraint that the braces need to disappear makes the problem quite a bit harder. At the moment, I'm not seeing an elegant way to handle it using just the looping idea. This works (and handles properly paired nested braces fine):
$ perl -pe 'while (s/{([^}]*) ([^}]*)}|{([^ }]*)}/$3 || "{$1_$2}"/ge) {};' blah {blah blah blah} blah blah {blah {blah blah} blah} blah blah blah blah_blah_blah blah blah blah_blah_blah_blah blah blahBut it's cheating by using /e to treat one regexp as two. Rhomboid's solution (which is the same as AmbroseChapel's, unless I'm mistaken*) is definitely more elegant and comprehensible.
* and considering my track record so far in this thread...
posted by aneel at 6:12 PM on January 21, 2007
« Older I'm startled awake as soon as I fall asleep. | Songs, Movies, Novels with "genetic enrichment" Newer »
This thread is closed to new comments.
You'll probably be able to do this in a single line if you're able to process the search results with a bit of code (with the 'e' flag in Perl, for example).
posted by Khalad at 12:09 PM on January 21, 2007