Convert text with whitespace to csv
March 20, 2012 12:32 PM   Subscribe

perl or sed or awk help for newbie

I am trying to get my tax preparation done and so I need to get info into Moneydance. One of my credit card accounts does not give ofx downloads, just pdfs of the monthly account summaries, and I have managed to extract the text from the pdfs and I now need to transform the text into csv format to import it. I have roughly a thousand transactions to process. I have MacOS 10.7 and can use the terminal but no regex skills.

The text is currently formatted as

MMM DD MMM DD PAYEE INFORMATION CITY PROVINCE
TRANSACTIONNUMBER COMMENT
$AMOUNT

and here is an example of two records

DEC 19 DEC 21 RADIO PARADISE 530¯872¯4993 CA
85450930354980007961385 Foreign Currency¯USD 20.00 Exchange rate¯1.046500
$20.93
DEC 23 DEC 24 SUNTERRA LENDRUM MA CALGARY AB
55181360357461606795546
$101.45

How can I transform this to a csv file such as

MMM DD, MMM DD, PAYEE INFORMATION CITY PROVINCE, TRANSACTIONNUMBER COMMENT, AMOUNT

which I can import into my financial software?
posted by v-tach to Computers & Internet (26 answers total) 4 users marked this as a favorite
 
Does something separate each record in the big file? Or are they continuous, one after the other?
posted by jquinby at 12:34 PM on March 20, 2012


gah, never mind
posted by jquinby at 12:36 PM on March 20, 2012


Response by poster: No, each record is three lines and the next record has no separator other than the carriage return.
posted by v-tach at 12:38 PM on March 20, 2012


are the fields fixed length? I'm not very efficient, but I'd dump the file into an array and parse every three lines using a for loop and substr.
posted by punchee at 12:53 PM on March 20, 2012


Response by poster: The date fields and transaction number are fixed, the rest are variable.
What I need for tax purposes is one of the two dates, the payee, and the amount, I could discard the rest.
posted by v-tach at 12:59 PM on March 20, 2012


Best answer: You can do some hacky perl.

Here's the perl:

#!/usr/bin/perl

my %times;

my $output = "";
while ( <> ){
my $entry = $_;
my $date1;
my $date2;
my $payee;
my $transactionNum;
my $comment;
my $amount;


if ($entry =~ m/([A-Z]{3}\s+\d{2})\s+([A-Z]{3}\s+\d{2})(.*)/){
$date1 = $1;
$date2 = $2;
$payee = $3;
$output .= "$date1, $date2, $payee,";

}elsif ($entry =~ m/(\d{10,})\s+(.*?)\n/){
$transactionNum = $1;
$comment = $2;

$output .= "$transactionNum, $comment";


}elsif ($entry =~ m/(\$\d+.*)/){
$amount = $1;

$output .= "$amount\n";

}
}

print "$output\n";

-------end of perl----
usage is like

cat | perl

I just tested, it seems to work:
cat testinput| / perl test-script

DEC 19, DEC 21, RADIO PARADISE 530¯872¯4993 CA,85450930354980007961385, Foreign Currency¯USD 20.00 Exchange rate¯1.046500$20.93
DEC 23, DEC 24, SUNTERRA LENDRUM MA CALGARY AB,$101.45

posted by lyra4 at 1:03 PM on March 20, 2012 [3 favorites]


Note that this is super hacky and I put each matched field into its own variable so that you can choose which ones you append into the output string.

Also... is there actually any way to correctly format code on mefi? :(
posted by lyra4 at 1:04 PM on March 20, 2012


Best answer: This should get you started in awk:
#!/usr/bin/awk -f
{
  mon1=$1;
  day1=$2;
  mon2=$3;
  day2=$4;
  payee=$5 " " $6 " " $7 " " $8 " " $9;
  i=getline;
  if (i <= 0) {
    print "* error reading file"
    exit;
  }
  transact=$0;
  i=getline;
  if (i <= 0) {
    print "* error reading file"
    exit;
  }
  amount=$1;
  print mon1 " " day1 "," mon2 " " day2 "," payee "," transact "," amount;
}
Make executable, then you just: cat foo.txt | ./yourscript.awk
posted by introp at 1:08 PM on March 20, 2012 [5 favorites]


Lyra: use the <pre> tag for preformatted blocks of text. Use <code> for one-liners (like my command line given at the end of my post).
posted by introp at 1:09 PM on March 20, 2012 [1 favorite]


introp: thank you! :o)

I've never seen awk match across multiple lines in quite that way. neat.
posted by lyra4 at 1:10 PM on March 20, 2012


Response by poster: The awk script is giving me an error


V-tach:RBC Statements emergencydoc$ cat RBC\ Dec\ 13-Jan112010 | ./awkscriptcsv
JAN 11 JAN 11 PURCHASE INTEREST 19.99% ,¯2775 ONAONExchange rate¯1.046500
* error reading file
posted by v-tach at 1:18 PM on March 20, 2012


Well, either your file doesn't exactly match the format given (there are blank lines in the interior, trailing blanks or partial records, etc.) or I don't at all understand how MacOS handles piping. The line of output showing there ("JAN 11 JAN 11" etc.) doesn't make sense at all.

Is the input file delimited by tabs, by any chance, and not spaces?
posted by introp at 1:23 PM on March 20, 2012


Response by poster: as does the perl script


V-tach:RBC Statements emergencydoc$ cat RBC\ Dec\ 13-Jan112010 | ./perlscriptcsv
JAN 11 JAN 11 PURCHASE INTEREST 19.99% ,$49.45ONAONExchange rate¯1.046500

they both are failing on two records, one with a long comment

DEC 19 DEC 21 RADIO PARADISE 530¯872¯4993 CA
85450930354980007961385 Foreign Currency¯USD 20.00 Exchange rate¯1.046500
$20.93

and the other an interest charge that has no transaction number

JAN 11 JAN 11 PURCHASE INTEREST 19.99%

$49.45
posted by v-tach at 1:27 PM on March 20, 2012


Response by poster: The input file has only white space and returns.
posted by v-tach at 1:29 PM on March 20, 2012


Response by poster: Here is one month's records that I'm using as a test file.


banktest.txt
posted by v-tach at 1:40 PM on March 20, 2012


Response by poster: oops

http://dl.dropbox.com/u/259834/banktest.txt
posted by v-tach at 1:41 PM on March 20, 2012


Just a side note... Are there any actual card numbers in this post or the file? You probably don't want to put things like that into the public internet without masking them.
posted by philipy at 1:45 PM on March 20, 2012


Response by poster: No, and no embarrassing purchases either!
posted by v-tach at 1:48 PM on March 20, 2012


What about copying the tabular data from the PDF into a spreadsheet? Opening a PDF in Acrobat will let you "Copy as Table" and paste that into whatever spreadsheet you prefer. Performing transformations from there on the spreadsheet should be cake after that.
posted by sub-culture at 1:49 PM on March 20, 2012


Well, I can process your data file with no trouble *if* I convert your Mac line endings into my platform's format:

$ cat banktest.txt | mac2unix | ./records.awk
DEC 13,DEC 14,AMAZON.CA AMAZON.CA ON ,55490530347000817023251,$147.97
DEC 13,DEC 15,CRESTWOOD APOTHECARY(P EDMONTON AB ,75259110347920789651500,$147.82
DEC 14,DEC 15,ITUNES 800¯676¯2775 ON ,55490530348000069666079,$14.99
DEC 15,DEC 17,ORIGINAL JOE’S EDMONTON AB ,55134420350800116193147,$128.93
DEC 15,DEC 16,MOUNTAIN EQUIP CO¯OP EDMONTON AB,55134420349800172792422,$32.55
DEC 16,DEC 17,SOUNDSPECTRUM INC 02123444400 NY ,55460290350200229000282 Foreign Currency¯USD 19.95 Exchange rate¯1.032581,$20.60
DEC 16,DEC 17,THE WILDBIRD GENERA EDMONTON AB,55181360350463608305461,$76.59
... etc. etc. etc.


If I don't include the mac2unix, I get trashed lines like you do. I would've *thought* all Mac utilities would understand native Mac line endings ('\r'), but apparently not?
posted by introp at 1:50 PM on March 20, 2012


That's funny you noted that, introp. I've been trying to do it with sed, and it doesn't like the Mac line endings, either, so I did: perl -p -e 's/\r/\n/g' < banktest.txt > banktestfixed.txt, and then sed behaved properly.
posted by retypepassword at 1:54 PM on March 20, 2012


Hm. I suppose I could've used sed to convert the format, too.
posted by retypepassword at 1:56 PM on March 20, 2012


Response by poster: Tried the acrobat paste-still not formatted.

I saved the file as text with Smultron, perhaps it didn't save with mac line endings?
posted by v-tach at 1:56 PM on March 20, 2012


Best answer: Your banktest.txt *does* have Mac line endings. It looks like your version of awk doesn't appreciate that, for some reason. (Or, at least, I get the same strange-looking parse errors as you do without running mac2unix on the stream.)

If you don't have mac2unix on your machine, try
cat banktest.txt | sed -e 's/\r/\n/g' | ./records.awk

posted by introp at 1:59 PM on March 20, 2012


Response by poster: Just checked Smultron preferences and I had it set to leave line endings alone, changed it to Unix line endings and will try again.
posted by v-tach at 2:00 PM on March 20, 2012


Response by poster: Sucess!!! with both scripts!!! Thanks to introp and lyra4!!!
posted by v-tach at 2:06 PM on March 20, 2012


« Older Help me find this umbrella?   |   What was this scifi comic from the 80s? Newer »
This thread is closed to new comments.