How Do I Extract Multiple Data and Put Them into CSV Format?
October 19, 2010 8:35 PM Subscribe
Extract Multiple Data (email, names, dates, and IDs) from a Huge Messy Text File, and Sort Them into a CSV: How do I do this? I've read this on extracting emails, but I can't figure out how to do this with several different data points.
To flesh things out a bit, here's an example of the text I'm starting with:
______
blah blah blah blah blah blahblah blah blah blah blah blah blah blah blahblah blah blah
blah blah blah blah blah blahblah blah blah
Date: 2007-03-26
Order: 09093SFB
Price: $1.01
Item: blahsomenumberblah
Title: some_item_title_text_string
Name1: JANE
Name2: SMITH
Ctry: US
State: MA
Postl: 10101
Email: blahblahblah@yahoo.com
blah blah blah blah
blah blah
blah blah blah blah blah blahblah blah blah
________________
I'm trying to sort about 2000 pages of data presently in the format above, putting the important stuff into these fields, within csv format:
EMAIL | FIRSTNAME | LASTNAME | TITLE | DATE
Everything but that which fits into the columns above can be discarded.
I've been toying with WinGrep and Notepad++, but to little avail. Any suggestions you could offer would be great. Thanks!
To flesh things out a bit, here's an example of the text I'm starting with:
______
blah blah blah blah blah blahblah blah blah blah blah blah blah blah blahblah blah blah
blah blah blah blah blah blahblah blah blah
Date: 2007-03-26
Order: 09093SFB
Price: $1.01
Item: blahsomenumberblah
Title: some_item_title_text_string
Name1: JANE
Name2: SMITH
Ctry: US
State: MA
Postl: 10101
Email: blahblahblah@yahoo.com
blah blah blah blah
blah blah
blah blah blah blah blah blahblah blah blah
________________
I'm trying to sort about 2000 pages of data presently in the format above, putting the important stuff into these fields, within csv format:
EMAIL | FIRSTNAME | LASTNAME | TITLE | DATE
Everything but that which fits into the columns above can be discarded.
I've been toying with WinGrep and Notepad++, but to little avail. Any suggestions you could offer would be great. Thanks!
The non-programming option is a series of clever find/replaces using wildcards in whatever text editor you have that won't choke on 2000 pages.
Less "extract the data" and more "chipping away at everything that isn't the data."
That's how I roll with these things.
posted by pantarei70 at 9:32 PM on October 19, 2010 [2 favorites]
Less "extract the data" and more "chipping away at everything that isn't the data."
That's how I roll with these things.
posted by pantarei70 at 9:32 PM on October 19, 2010 [2 favorites]
This might do it. Put it in Excel. Number the rows. Sort with two factors, the first is your column with the labels, the second is row number. Your data will now be listed with all of the Name1 values in order, and your Name2 values will be in the same order (so if you line up the data, name1 and name2 should match).
You can copy and paste those sorted values into distinct columns in the order you want.
posted by Gorgik at 9:33 PM on October 19, 2010 [1 favorite]
You can copy and paste those sorted values into distinct columns in the order you want.
posted by Gorgik at 9:33 PM on October 19, 2010 [1 favorite]
Response by poster: > Chuck that in the VB editor in excel
Thanks all-- I should note that the file is in .txt format; it's completely unstructured, with not even a comma between the data.
Also, I'm using OpenOffice Calc rather than Excel.
Do these factors make a difference?
posted by darth_tedious at 9:55 PM on October 19, 2010
Thanks all-- I should note that the file is in .txt format; it's completely unstructured, with not even a comma between the data.
Also, I'm using OpenOffice Calc rather than Excel.
Do these factors make a difference?
posted by darth_tedious at 9:55 PM on October 19, 2010
Best answer: Pretty easy to do reliably with a fairly simple perl script. Grab a copy of Strawberry Perl for Windows. Install it if you can, otherwise you'll have to use the portable edition.
Next up, pop this perl script somewhere:
There's a chance that the library Text::CSV_XS doesn't come with strawberry perl, in which case you'll need to install the library with the command:
Finally, if you do this kind of thing even slightly often, Learning Perl (links to books, one good and not free, one OK and free) could be very valuable to you, and could contribute to your career success.
Final issue. If your file is gigabytes in size, then you might run out of system memory as this script reads the entire file into memory before going through it. In which case what I'd do for a one-off job is split the file into smaller chunks and run through them one at a time (making sure I didn't overwrite the 'data.csv' file each run through).
p.s. the script took me about the same amount of time to write as the answer, and I tested it with your dummy data. It uses a regular expression to grab the rows that you're interested in.
posted by singingfish at 9:57 PM on October 19, 2010 [4 favorites]
Next up, pop this perl script somewhere:
use warnings; use strict; my $file = $ARGV[0] or die "provide file name on the command line\n"; open my $IN, "<> my @head = qw/Date Name1 Name2 Email Title/; my $data_fields = join "|", @head; use Text::CSV_XS; my $csv = Text::CSV_XS->new(); open my $OUT, ">", "data.csv"; $csv->combine(@head); print $OUT $csv->string, "\n"; my %record; my @results; my $break = 0; while (<>) { chomp; $break=1 if $_ eq 'break here'; $DB::single=1 if $break; my ($field, $info) = $_ =~ /($data_fields):\s+(.*+)$/; if ($field) { if (! $record{$field}) { $record{$field} = $info; } else { my @row = @record{@head}; $csv->combine(@row); print $OUT $csv->string,"\n"; undef %record; $record{$field} = $info; } } } # flush out the last row my @row = @record{@head}; $csv->combine(@row); print $OUT $csv->string,"\n"; >>have your data file in the same directory as the script, and run the script from cmd.exe:
perl my_script_name.pl my_data_file.txtThe output you need should be in data.txt
There's a chance that the library Text::CSV_XS doesn't come with strawberry perl, in which case you'll need to install the library with the command:
cpan Text::CSV_XSagain from the command line. (perl and cpan have to be in your PATH - automatically taken care of if you use the installer, a little more complicated if you need to use the portable edition).
Finally, if you do this kind of thing even slightly often, Learning Perl (links to books, one good and not free, one OK and free) could be very valuable to you, and could contribute to your career success.
Final issue. If your file is gigabytes in size, then you might run out of system memory as this script reads the entire file into memory before going through it. In which case what I'd do for a one-off job is split the file into smaller chunks and run through them one at a time (making sure I didn't overwrite the 'data.csv' file each run through).
p.s. the script took me about the same amount of time to write as the answer, and I tested it with your dummy data. It uses a regular expression to grab the rows that you're interested in.
posted by singingfish at 9:57 PM on October 19, 2010 [4 favorites]
argh, to be more robust, this line:
posted by singingfish at 10:01 PM on October 19, 2010
my ($field, $info) = $_ =~ /($data_fields):\s+(.*+)$/;should probably be changed to this:
my ($field, $info) = $_ =~ /^($data_fields):\s+(.*+)$/;to anchor the match for the data field name only if it's at the beginning of the line.
posted by singingfish at 10:01 PM on October 19, 2010
Response by poster: > Pretty easy to do reliably with a fairly simple perl script. Grab a copy of Strawberry Perl for Windows.
Thanks!
posted by darth_tedious at 10:16 PM on October 19, 2010
Thanks!
posted by darth_tedious at 10:16 PM on October 19, 2010
Best answer: Here you go, in Python and it uses streams so it'll be fast and easily deal with the quantity of data you've got.
Usage:
cat mefi-data.txt | ./mefi-munger.py > result.csv
#!/usr/bin/python2.6
import sys
fields = [["Email","EMAIL"],["Name1","FIRSTNAME"],["Name2","LASTNAME"],["Title","TITLE"],["Date","DATE"]]
row_template = {}
for field in fields:
print field[1],",",
row_template[field[0]] = ""
print ""
row = {}
while True:
line = sys.stdin.readline()
if not line: break
elif line.find(":") == -1: continue
field = line[0:line.find(":")]
if field in row_template.keys():
row[field]=line[line.find(":")+1:].strip()
if len(row) == len(fields):
for field in fields:
print row[field[0]],",",
print ""
row = {}
posted by holloway at 10:25 PM on October 19, 2010 [1 favorite]
Usage:
cat mefi-data.txt | ./mefi-munger.py > result.csv
#!/usr/bin/python2.6
import sys
fields = [["Email","EMAIL"],["Name1","FIRSTNAME"],["Name2","LASTNAME"],["Title","TITLE"],["Date","DATE"]]
row_template = {}
for field in fields:
print field[1],",",
row_template[field[0]] = ""
print ""
row = {}
while True:
line = sys.stdin.readline()
if not line: break
elif line.find(":") == -1: continue
field = line[0:line.find(":")]
if field in row_template.keys():
row[field]=line[line.find(":")+1:].strip()
if len(row) == len(fields):
for field in fields:
print row[field[0]],",",
print ""
row = {}
posted by holloway at 10:25 PM on October 19, 2010 [1 favorite]
Best answer: Just for hack's sake, the cheap-o Perl way to do it, which would work if the fields we want are always all present in that order and never contain double quotes, is:
posted by nicwolff at 10:44 PM on October 19, 2010 [1 favorite]
perl -lne '@F = split /: /, $_, 2; $data{$F[0]} = $F[1]; if ( $F[0] eq "Email" ) { print join ", ", map qq("$_"), @data{ qw(Email Name1 Name2 Title Date) } }'
posted by nicwolff at 10:44 PM on October 19, 2010 [1 favorite]
oops, the line that says:
And I was wrong it doesn't read the whole file into memory, it uses something similar to streams in python. So the quantity of data isn't an issue.
posted by singingfish at 11:41 PM on October 19, 2010
while (<>) { >should say:
while (<$IN>) {(stupid html formatting error :] )
And I was wrong it doesn't read the whole file into memory, it uses something similar to streams in python. So the quantity of data isn't an issue.
posted by singingfish at 11:41 PM on October 19, 2010
Oh yes, and the reason I used the CPAN module rather than hand-rolling a CSV solution was that CSV edge cases can be notoriously squiggly, and using someone elses' well tested library saves you the pain of making any mistakes :)
posted by singingfish at 11:47 PM on October 19, 2010 [1 favorite]
posted by singingfish at 11:47 PM on October 19, 2010 [1 favorite]
Response by poster: > Here you go, in Python
Thanks!
In running this, I'm just getting
Email Name1 Name2 Title Date at the command prompt
rather than seeing a result.csv.
Is this likely a matter of my getting the path to Python wrong, or of not configuring properly?
posted by darth_tedious at 11:55 PM on October 19, 2010
Thanks!
In running this, I'm just getting
Email Name1 Name2 Title Date at the command prompt
rather than seeing a result.csv.
Is this likely a matter of my getting the path to Python wrong, or of not configuring properly?
posted by darth_tedious at 11:55 PM on October 19, 2010
and it still has the stuff I needed for debugging so these two lines are totaly useless outside the debugger (and the test data I used) as well:
posted by singingfish at 11:56 PM on October 19, 2010
$break=1 if $_ eq 'break here'; $DB::single=1 if $break;
posted by singingfish at 11:56 PM on October 19, 2010
Response by poster: > and it still has the stuff
Thanks for the addendum; I'll run Strawberry Perl also...
posted by darth_tedious at 12:27 AM on October 20, 2010
Thanks for the addendum; I'll run Strawberry Perl also...
posted by darth_tedious at 12:27 AM on October 20, 2010
Is this likely a matter of my getting the path to Python wrong, or of not configuring properly?Even with an empty data file It should print the CSV header which is a single line containing "EMAIL , FIRSTNAME , LASTNAME , TITLE , DATE ," rather than the output you've seen, so you've probably got your Python misconfigured somehow. Make sure the shebang at the top of the file points to your version of Python. Once this is configured I've tried the script on Windows and Linux and it works.
On Linux if you save that script as mefi-munger.py and then run chmod +x ./mefi-munger.py on it you'll be able to stream data to it with the cat command. e.g. if your data file is called mefi-data.txt then just run,
cat mefi-data.txt | ./mefi-munger.py
And you should see the generated CSV file whiz by. You can then redirect this output to a file by running,
cat mefi-data.txt | ./mefi-munger.py > result.csv
But leave out the > result.csv bit until you've got it working. Good luck!
posted by holloway at 12:40 AM on October 20, 2010 [1 favorite]
I'm guessing that windows doesn't support the usual Unix conventions of pipes properly, assuming that you're on windows. The perl script I posted should work on any system that can compile perl.
posted by singingfish at 2:09 AM on October 20, 2010
posted by singingfish at 2:09 AM on October 20, 2010
Windows supports pipes, redirection, and even stdout and stderr.
posted by holloway at 2:28 AM on October 20, 2010
posted by holloway at 2:28 AM on October 20, 2010
On Windows the command called type is apparently similar to cat so that would make the command,
type mefi-data.txt | ./mefi-munger.py > result.csv
posted by holloway at 2:48 AM on October 20, 2010
type mefi-data.txt | ./mefi-munger.py > result.csv
posted by holloway at 2:48 AM on October 20, 2010
This is what awk is for. Yeah, Perl and Python might be faster to run or cooler, but Awk has this super simple Pattern-Action syntax that is faster and more compact to write, and you have more of a chance of understanding what's going on and changing the program accordingly.
posted by scruss at 4:50 AM on October 20, 2010
posted by scruss at 4:50 AM on October 20, 2010
This thread is closed to new comments.
Sub AskMe()
Dim sOut As String
Open "C:\testdata.txt" For Input As 1
Open "C:\output.csv" For Output As 2
While Not EOF(1)
Line Input #1, strThisLine
strThisLine = Trim(strThisLine)
If Left(strThisLine, 4) = "Date" Then sDate = Right(strThisLine, Len(strThisLine) - 6)
If Left(strThisLine, 5) = "Title" Then sTitle = Right(strThisLine, Len(strThisLine) - 7)
If Left(strThisLine, 5) = "Name1" Then sName1 = Right(strThisLine, Len(strThisLine) - 7)
If Left(strThisLine, 5) = "Name2" Then sName2 = Right(strThisLine, Len(strThisLine) - 7)
If Left(strThisLine, 5) = "Email" Then
sEmail = Right(strThisLine, Len(strThisLine) - 7)
Print #2, sEmail & ", " & sName1 & ", " & sName2 & ", " & sTitle & ", " & sDate
End If
Wend
Close 2
Close 1
End Sub
Chuck that in the VB editor in excel, fix the input and output filenames, and run.
posted by pompomtom at 9:09 PM on October 19, 2010 [2 favorites]