Filtering groups of records using awk?
April 23, 2006 2:50 AM   Subscribe

How to remove groups of records from a text file using awk or similar in Unix based on a flag at the end of the record-groups?

I have text files containing a number of ordered groups of records, where each group comprises a single record beginning with the characters '10', followed by an arbitrary number of records beginning '20', '30', '40', etc., and ending with a single record beginning with the characters '99'. Some of the '99' records are tagged ('99|B|') to show that the block as a whole is 'bad' and should be discarded (at the end of 'good' blocks the record is '99||')—what’s a good way to do this using a tool such as awk?

I have only ever attempted simple awk one-liners before, and, while I could probably figure this out given enough time, I would really appreciate some hand-holding on this particular problem. I won’t be able to use perl, python or whatnot in this case. Thank you!
posted by misteraitch to Computers & Internet (27 answers total)
 
How are your records delimited? Can you post a portion of the data (or a made-up example)?
posted by Mikey-San at 3:10 AM on April 23, 2006


The records are newline-delimited, and each record is a variable-length pipe-character-delimited string. The only sample I have has some confidential information in it, so here's a simplified approximation:

10|Account1|Name|Address|
20|Amount1|Amount2|
30|Message|
40|Product1|Amount|
50|Detail Line 1|
40|Product2|Amount|
50|Detail Line 1|
50|Detail Line 2|
99||
10|Account2|Name|Address|
20|Amount1|Amount2|
30|Message1|
30|Message2|
40|Product1|Amount|
50|Detail Line 1|
50|Detail Line 2|
50|Detail Line 3|
40|Product2|Amount|
50|Detail Line 1|
50|Detail Line 2|
40|Product3|Amount|
50|Detail Line 1|
99|B|
10|Account3|Name|Address|
20|Amount1|Amount2|
30|Message|
40|Product|Amount|
99||
posted by misteraitch at 3:29 AM on April 23, 2006


So, with this example, the records from:

10|Account2|Name|Address|

to:

99|B|

are collectively 'bad' and I need to discard them.
posted by misteraitch at 3:31 AM on April 23, 2006


Why won't you be able to use Perl in this case? You know it can be invoked with code on the command-line and that it's available for Win32 too. If you're planning on awk, why is Perl verboten?

Anyway, you'll need to write a program that remembers some state, ie the position of recent 99|| records and which ones have |B| in them. Simple pipes are stateless and will not suffice for this.

I'm not an awk hacker at all, it might manage it but I don't think so. Could be wrong.
posted by polyglot at 3:44 AM on April 23, 2006


Yes, why not Perl?

But as I'm an accomplished programmer in pseudocode, the logic obviously goes something like this:

when you see a "10":
start capturing each line to a variable
when you get to a 99:
if it's followed by a B,
print the variable to your other file,
dump the variable.
if it's not,
dump the variable.


lather, rinse, repeat.
posted by AmbroseChapel at 3:51 AM on April 23, 2006


Pipe it trough

b=; while read -r a; do b="$b\n$a"; if echo $b | grep -q '^99|'; then if ! echo $b | grep -q '^99|B|'; then echo -n $b; fi; b=; fi; done

This will probably go wrong if your records contain binary data and I'd check to make sure it works on backslashes and the like. Also, you get an extra new line as the first line of your output, if that's a problem it's easy to remove but it makes the oneliners slightly longer.
posted by fvw at 3:58 AM on April 23, 2006


Just for the record, this is the simplest I can make it in The Language Which Dare Not Speak Its Name:

perl -ne 'if(/^99\|\|/){print $x;$x=""}elsif(/^99\|B\|/){$x=""}else{$x.=$_}' file.txt
posted by AmbroseChapel at 4:27 AM on April 23, 2006


This script works in GNU awk, but probably not in other versions (I don't think the RT variable is a standard feature).

BEGIN { RS = "99\\|B?\\|\n"; ORS = "" }
RT == "99||\n" { print $0 RT }
posted by reynaert at 4:37 AM on April 23, 2006


You'll want to make that print $x,$_.
posted by fvw at 4:38 AM on April 23, 2006


My previous comment referred to Ambrose's code by the way.


Though I'm aware this isn't helping the asker at all, I can't resist a short perl challenge, so here's mine:

perl -ne '$f.=$_;/^99.B/ and$f="";/^99/&&print$f'

You can leave away the space before the and, but it's a miracle the parser doesn't barf on that.
posted by fvw at 4:47 AM on April 23, 2006


Thanks for pointing that out of course you're right, fv, my golf buddy...
posted by AmbroseChapel at 4:57 AM on April 23, 2006


I don't know awk that well, but it can be done with sed. I don't know sed that well either, but this seems to work:

sed -n ' :start; s/99|B|/a/; t bad; s/99||/99||/; t print; H; n; b start;:print; H; x; p; :bad; s/.*//; x;' < input
posted by sfenders at 5:35 AM on April 23, 2006


That poor bit of sed needs some improvement.

sed -n ':A;/^99||/{H;x;s/^\n//;p};/^99/{s/.*//;x;n;bA};H;n;bA'
posted by sfenders at 9:52 AM on April 23, 2006


Now that the question has been answered, I'd really like to know what kind of odd reason there would be that awk is acceptable and perl is not. I guess if you were working in an embedded environment or something, memory constraints might come into play.
posted by Rhomboid at 1:25 PM on April 23, 2006


Address ranges work if you process the file backwards:

tac | sed '/^99|B|/,/^10|/d' | tac
posted by flabdablet at 5:52 PM on April 23, 2006


Ooh, I never knew tac existed, thanks flabdablet. (and you win on shortest too)
posted by fvw at 6:04 PM on April 23, 2006


This is shorter. More efficient, too, since it doesn't need the two reversals.

sed -n 'H;/^10/h;/^99||/{g;p}'

It ignores 99|B| markers altogether, simply gathering input lines into the hold space, restarting the collection every time it sees a line that starts with 10|, and printing them all out when it sees a line that starts with 99||.

I've tested this against the data you provided using gnu sed (whose binary is less than a third of the size of gawk's), and it works. Beat that, perl-heads :)
posted by flabdablet at 8:36 PM on April 23, 2006


<goes off to learn sed>
posted by AmbroseChapel at 9:37 PM on April 23, 2006


Thanks everyone: it turns out that I assumed wrong wrt perl—I understood it hadn't been installed on the server this would run on—but it turns out that it has. It's a Compaq Alpha True-64 box, by the way. Alas, there's no tac or gawk on it…
posted by misteraitch at 11:29 PM on April 23, 2006


Given flabdablet's second approach I can whittle it down to
perl -pe '$f=/^10/?$_:$f.$_;$_=/^99..$/?$f:""'
but I'm afraid sed is going to win this one, even the single-big-regexp approach isn't going to work here due to it's setup overhead.
posted by fvw at 2:43 AM on April 24, 2006


Surely it has sed, though. I've yet to see a nixbox without sed.

But even if it doesn't have sed, awk, grep or perl: as long as it has bash or even plain old sh, you're ok. Pipe or redirect your data through this:

while read -r p
do
    h="$h"$'\n'"$p"
    [[ "${p:0:3}" = '10|' ]] && h="$p"
    [[ "${p:0:4}" = '99||' ]] && echo "$h"
done


The logic is identical to the sed script I posted earlier, using only sh builtin commands. I tested it with your supplied data and it works.

But you will have sed. Everyone has sed.
posted by flabdablet at 3:12 AM on April 24, 2006


flabdablet—yes, it has sed and regular awk and nawk and perl. I can't get the sed commands as above to run on it for whatever reason (probably plain user-stupidity on my part): I'll try the latest one above & see how that is. fvw's script works but is a tad slow given the real files. I must have over-simplified with my example above, as a couple of the perl solutions don't work with the real data. My thanks again to everyone: I guess I really need to get myself sent away to scripting boot-camp…
posted by misteraitch at 4:14 AM on April 24, 2006


When sed fails, how does it fail?
posted by flabdablet at 4:33 AM on April 24, 2006


with:

sed: Function H;/^10/h;/^99||/{g;p} cannot be parsed.
posted by misteraitch at 4:48 AM on April 24, 2006


And your latest scriptlet works fine & is plenty fast enough, by the way.
posted by misteraitch at 4:50 AM on April 24, 2006


Just checking the Tru64 manpage for sed, it looks like that version requires newlines where gnu sed allows semicolons. Try entering that script fragment as

sed -n $'H\n/^10/h\n/^99||/{\ng\np\n}'

That will clearly work, because it now looks completely like modem line noise (sed is one of the earliest designs for a useful write-only language).

The following is equivalent, but might look even worse when buried in your overall script.

sed -n 'H
/^10/h
/^99||/{
g
p
}'

posted by flabdablet at 4:59 AM on April 24, 2006


Sorry. /^10/h should be /^10|/h so it still works if there are data lines starting with 10 not immediately followed by | - probably good practice even if this isn't supposed to be possible.
posted by flabdablet at 5:13 AM on April 24, 2006


« Older Why Borrow?   |   How is American Idol's Phone System Configured? Newer »
This thread is closed to new comments.