Convert text with whitespace to csv
March 20, 2012 12:32 PM Subscribe
perl or sed or awk help for newbie
I am trying to get my tax preparation done and so I need to get info into Moneydance. One of my credit card accounts does not give ofx downloads, just pdfs of the monthly account summaries, and I have managed to extract the text from the pdfs and I now need to transform the text into csv format to import it. I have roughly a thousand transactions to process. I have MacOS 10.7 and can use the terminal but no regex skills.
The text is currently formatted as
MMM DD MMM DD PAYEE INFORMATION CITY PROVINCE
TRANSACTIONNUMBER COMMENT
$AMOUNT
and here is an example of two records
DEC 19 DEC 21 RADIO PARADISE 530¯872¯4993 CA
85450930354980007961385 Foreign Currency¯USD 20.00 Exchange rate¯1.046500
$20.93
DEC 23 DEC 24 SUNTERRA LENDRUM MA CALGARY AB
55181360357461606795546
$101.45
How can I transform this to a csv file such as
MMM DD, MMM DD, PAYEE INFORMATION CITY PROVINCE, TRANSACTIONNUMBER COMMENT, AMOUNT
which I can import into my financial software?
I am trying to get my tax preparation done and so I need to get info into Moneydance. One of my credit card accounts does not give ofx downloads, just pdfs of the monthly account summaries, and I have managed to extract the text from the pdfs and I now need to transform the text into csv format to import it. I have roughly a thousand transactions to process. I have MacOS 10.7 and can use the terminal but no regex skills.
The text is currently formatted as
MMM DD MMM DD PAYEE INFORMATION CITY PROVINCE
TRANSACTIONNUMBER COMMENT
$AMOUNT
and here is an example of two records
DEC 19 DEC 21 RADIO PARADISE 530¯872¯4993 CA
85450930354980007961385 Foreign Currency¯USD 20.00 Exchange rate¯1.046500
$20.93
DEC 23 DEC 24 SUNTERRA LENDRUM MA CALGARY AB
55181360357461606795546
$101.45
How can I transform this to a csv file such as
MMM DD, MMM DD, PAYEE INFORMATION CITY PROVINCE, TRANSACTIONNUMBER COMMENT, AMOUNT
which I can import into my financial software?
Response by poster: No, each record is three lines and the next record has no separator other than the carriage return.
posted by v-tach at 12:38 PM on March 20, 2012
posted by v-tach at 12:38 PM on March 20, 2012
are the fields fixed length? I'm not very efficient, but I'd dump the file into an array and parse every three lines using a for loop and substr.
posted by punchee at 12:53 PM on March 20, 2012
posted by punchee at 12:53 PM on March 20, 2012
Response by poster: The date fields and transaction number are fixed, the rest are variable.
What I need for tax purposes is one of the two dates, the payee, and the amount, I could discard the rest.
posted by v-tach at 12:59 PM on March 20, 2012
What I need for tax purposes is one of the two dates, the payee, and the amount, I could discard the rest.
posted by v-tach at 12:59 PM on March 20, 2012
Best answer: You can do some hacky perl.
Here's the perl:
#!/usr/bin/perl
my %times;
my $output = "";
while ( <> ){
my $entry = $_;
my $date1;
my $date2;
my $payee;
my $transactionNum;
my $comment;
my $amount;
if ($entry =~ m/([A-Z]{3}\s+\d{2})\s+([A-Z]{3}\s+\d{2})(.*)/){
$date1 = $1;
$date2 = $2;
$payee = $3;
$output .= "$date1, $date2, $payee,";
}elsif ($entry =~ m/(\d{10,})\s+(.*?)\n/){
$transactionNum = $1;
$comment = $2;
$output .= "$transactionNum, $comment";
}elsif ($entry =~ m/(\$\d+.*)/){
$amount = $1;
$output .= "$amount\n";
}
}
print "$output\n";
-------end of perl----
usage is like
cat | perl
I just tested, it seems to work:
cat testinput| / perl test-script
DEC 19, DEC 21, RADIO PARADISE 530¯872¯4993 CA,85450930354980007961385, Foreign Currency¯USD 20.00 Exchange rate¯1.046500$20.93
DEC 23, DEC 24, SUNTERRA LENDRUM MA CALGARY AB,$101.45 >
posted by lyra4 at 1:03 PM on March 20, 2012 [3 favorites]
Here's the perl:
#!/usr/bin/perl
my %times;
my $output = "";
while ( <> ){
my $entry = $_;
my $date1;
my $date2;
my $payee;
my $transactionNum;
my $comment;
my $amount;
if ($entry =~ m/([A-Z]{3}\s+\d{2})\s+([A-Z]{3}\s+\d{2})(.*)/){
$date1 = $1;
$date2 = $2;
$payee = $3;
$output .= "$date1, $date2, $payee,";
}elsif ($entry =~ m/(\d{10,})\s+(.*?)\n/){
$transactionNum = $1;
$comment = $2;
$output .= "$transactionNum, $comment";
}elsif ($entry =~ m/(\$\d+.*)/){
$amount = $1;
$output .= "$amount\n";
}
}
print "$output\n";
-------end of perl----
usage is like
cat
I just tested, it seems to work:
cat testinput| / perl test-script
DEC 19, DEC 21, RADIO PARADISE 530¯872¯4993 CA,85450930354980007961385, Foreign Currency¯USD 20.00 Exchange rate¯1.046500$20.93
DEC 23, DEC 24, SUNTERRA LENDRUM MA CALGARY AB,$101.45
posted by lyra4 at 1:03 PM on March 20, 2012 [3 favorites]
Note that this is super hacky and I put each matched field into its own variable so that you can choose which ones you append into the output string.
Also... is there actually any way to correctly format code on mefi? :(
posted by lyra4 at 1:04 PM on March 20, 2012
Also... is there actually any way to correctly format code on mefi? :(
posted by lyra4 at 1:04 PM on March 20, 2012
Best answer: This should get you started in awk:
posted by introp at 1:08 PM on March 20, 2012 [5 favorites]
#!/usr/bin/awk -f { mon1=$1; day1=$2; mon2=$3; day2=$4; payee=$5 " " $6 " " $7 " " $8 " " $9; i=getline; if (i <= 0) { print "* error reading file" exit; } transact=$0; i=getline; if (i <= 0) { print "* error reading file" exit; } amount=$1; print mon1 " " day1 "," mon2 " " day2 "," payee "," transact "," amount; }Make executable, then you just:
cat foo.txt | ./yourscript.awk
posted by introp at 1:08 PM on March 20, 2012 [5 favorites]
Lyra: use the <pre> tag for preformatted blocks of text. Use <code> for one-liners (like my command line given at the end of my post).
posted by introp at 1:09 PM on March 20, 2012 [1 favorite]
posted by introp at 1:09 PM on March 20, 2012 [1 favorite]
introp: thank you! :o)
I've never seen awk match across multiple lines in quite that way. neat.
posted by lyra4 at 1:10 PM on March 20, 2012
I've never seen awk match across multiple lines in quite that way. neat.
posted by lyra4 at 1:10 PM on March 20, 2012
Response by poster: The awk script is giving me an error
V-tach:RBC Statements emergencydoc$ cat RBC\ Dec\ 13-Jan112010 | ./awkscriptcsv
JAN 11 JAN 11 PURCHASE INTEREST 19.99% ,¯2775 ONAONExchange rate¯1.046500
* error reading file
posted by v-tach at 1:18 PM on March 20, 2012
V-tach:RBC Statements emergencydoc$ cat RBC\ Dec\ 13-Jan112010 | ./awkscriptcsv
JAN 11 JAN 11 PURCHASE INTEREST 19.99% ,¯2775 ONAONExchange rate¯1.046500
* error reading file
posted by v-tach at 1:18 PM on March 20, 2012
Well, either your file doesn't exactly match the format given (there are blank lines in the interior, trailing blanks or partial records, etc.) or I don't at all understand how MacOS handles piping. The line of output showing there ("JAN 11 JAN 11" etc.) doesn't make sense at all.
Is the input file delimited by tabs, by any chance, and not spaces?
posted by introp at 1:23 PM on March 20, 2012
Is the input file delimited by tabs, by any chance, and not spaces?
posted by introp at 1:23 PM on March 20, 2012
Response by poster: as does the perl script
V-tach:RBC Statements emergencydoc$ cat RBC\ Dec\ 13-Jan112010 | ./perlscriptcsv
JAN 11 JAN 11 PURCHASE INTEREST 19.99% ,$49.45ONAONExchange rate¯1.046500
they both are failing on two records, one with a long comment
DEC 19 DEC 21 RADIO PARADISE 530¯872¯4993 CA
85450930354980007961385 Foreign Currency¯USD 20.00 Exchange rate¯1.046500
$20.93
and the other an interest charge that has no transaction number
JAN 11 JAN 11 PURCHASE INTEREST 19.99%
$49.45
posted by v-tach at 1:27 PM on March 20, 2012
V-tach:RBC Statements emergencydoc$ cat RBC\ Dec\ 13-Jan112010 | ./perlscriptcsv
JAN 11 JAN 11 PURCHASE INTEREST 19.99% ,$49.45ONAONExchange rate¯1.046500
they both are failing on two records, one with a long comment
DEC 19 DEC 21 RADIO PARADISE 530¯872¯4993 CA
85450930354980007961385 Foreign Currency¯USD 20.00 Exchange rate¯1.046500
$20.93
and the other an interest charge that has no transaction number
JAN 11 JAN 11 PURCHASE INTEREST 19.99%
$49.45
posted by v-tach at 1:27 PM on March 20, 2012
Response by poster: The input file has only white space and returns.
posted by v-tach at 1:29 PM on March 20, 2012
posted by v-tach at 1:29 PM on March 20, 2012
Response by poster: Here is one month's records that I'm using as a test file.
banktest.txt
posted by v-tach at 1:40 PM on March 20, 2012
banktest.txt
posted by v-tach at 1:40 PM on March 20, 2012
Response by poster: oops
http://dl.dropbox.com/u/259834/banktest.txt
posted by v-tach at 1:41 PM on March 20, 2012
http://dl.dropbox.com/u/259834/banktest.txt
posted by v-tach at 1:41 PM on March 20, 2012
Just a side note... Are there any actual card numbers in this post or the file? You probably don't want to put things like that into the public internet without masking them.
posted by philipy at 1:45 PM on March 20, 2012
posted by philipy at 1:45 PM on March 20, 2012
Response by poster: No, and no embarrassing purchases either!
posted by v-tach at 1:48 PM on March 20, 2012
posted by v-tach at 1:48 PM on March 20, 2012
What about copying the tabular data from the PDF into a spreadsheet? Opening a PDF in Acrobat will let you "Copy as Table" and paste that into whatever spreadsheet you prefer. Performing transformations from there on the spreadsheet should be cake after that.
posted by sub-culture at 1:49 PM on March 20, 2012
posted by sub-culture at 1:49 PM on March 20, 2012
Well, I can process your data file with no trouble *if* I convert your Mac line endings into my platform's format:
If I don't include the mac2unix, I get trashed lines like you do. I would've *thought* all Mac utilities would understand native Mac line endings ('\r'), but apparently not?
posted by introp at 1:50 PM on March 20, 2012
$ cat banktest.txt | mac2unix | ./records.awk
DEC 13,DEC 14,AMAZON.CA AMAZON.CA ON ,55490530347000817023251,$147.97
DEC 13,DEC 15,CRESTWOOD APOTHECARY(P EDMONTON AB ,75259110347920789651500,$147.82
DEC 14,DEC 15,ITUNES 800¯676¯2775 ON ,55490530348000069666079,$14.99
DEC 15,DEC 17,ORIGINAL JOE’S EDMONTON AB ,55134420350800116193147,$128.93
DEC 15,DEC 16,MOUNTAIN EQUIP CO¯OP EDMONTON AB,55134420349800172792422,$32.55
DEC 16,DEC 17,SOUNDSPECTRUM INC 02123444400 NY ,55460290350200229000282 Foreign Currency¯USD 19.95 Exchange rate¯1.032581,$20.60
DEC 16,DEC 17,THE WILDBIRD GENERA EDMONTON AB,55181360350463608305461,$76.59
... etc. etc. etc.
If I don't include the mac2unix, I get trashed lines like you do. I would've *thought* all Mac utilities would understand native Mac line endings ('\r'), but apparently not?
posted by introp at 1:50 PM on March 20, 2012
That's funny you noted that, introp. I've been trying to do it with sed, and it doesn't like the Mac line endings, either, so I did:
posted by retypepassword at 1:54 PM on March 20, 2012
perl -p -e 's/\r/\n/g' < banktest.txt > banktestfixed.txt
, and then sed behaved properly.posted by retypepassword at 1:54 PM on March 20, 2012
Hm. I suppose I could've used sed to convert the format, too.
posted by retypepassword at 1:56 PM on March 20, 2012
posted by retypepassword at 1:56 PM on March 20, 2012
Response by poster: Tried the acrobat paste-still not formatted.
I saved the file as text with Smultron, perhaps it didn't save with mac line endings?
posted by v-tach at 1:56 PM on March 20, 2012
I saved the file as text with Smultron, perhaps it didn't save with mac line endings?
posted by v-tach at 1:56 PM on March 20, 2012
Best answer: Your banktest.txt *does* have Mac line endings. It looks like your version of awk doesn't appreciate that, for some reason. (Or, at least, I get the same strange-looking parse errors as you do without running mac2unix on the stream.)
If you don't have mac2unix on your machine, try
posted by introp at 1:59 PM on March 20, 2012
If you don't have mac2unix on your machine, try
cat banktest.txt | sed -e 's/\r/\n/g' | ./records.awk
posted by introp at 1:59 PM on March 20, 2012
Response by poster: Just checked Smultron preferences and I had it set to leave line endings alone, changed it to Unix line endings and will try again.
posted by v-tach at 2:00 PM on March 20, 2012
posted by v-tach at 2:00 PM on March 20, 2012
Response by poster: Sucess!!! with both scripts!!! Thanks to introp and lyra4!!!
posted by v-tach at 2:06 PM on March 20, 2012
posted by v-tach at 2:06 PM on March 20, 2012
This thread is closed to new comments.
posted by jquinby at 12:34 PM on March 20, 2012