Simple data extraction program from TXT to CSV?
January 14, 2012 9:27 AM

I have several hundred machine-generated, human-readable TXT files; one text file per serialized device tested. Each file has test values in predictable places, although in a very irregularly-shaped "table" if you could even call it that. I want a simple Windows program where I say: "Okay, program, here's the row and columns where field #1 starts and ends, ... here's the row and columns where field #n starts and ends ... Give me a CSV file with those values I need and nothing more."

Is there such a program, freeware or otherwise? Seems like there should be, but my Google-foo fails me. Excel and Calc, as near as I can tell, aren't what I need. Prefer something with a GUI. Prefer not to have to "program," apart from having to enter start/end locations for the values I want. Thanks hive-mind!
posted by ZenMasterThis to Computers & Internet (18 answers total) 1 user marked this as a favorite
If I understand the problem correctly, I think you could import the file into Google Refine with each row in a single column and tell Refine to splitByLengths() or smartSplit() the row into columns.

Or you could just get some Windows implementation (GNUWin32?) of Unix utilities such as cut or awk (a.k.a. gawk) to do the job at the command line.

But if you get as far as having to use awk, you might as well use Perl, Python, Ruby, etc. I understand you're reluctant to program, but this is a very easy task. One simplification I'd recommend is not outputting to CSV as an intermediate step but rather TSV, because Refine, Excel, etc. will read that in fine.
posted by Monsieur Caution at 10:04 AM on January 14, 2012


I don't know if you consider building regular expressions to be "programming", but many text editors have regexp-enabled search and replace; I suspect if you construct a good regexp (which I suspect mefites can help with) you can get back what you want.
posted by Philosopher Dirtbike at 10:31 AM on January 14, 2012


Philosopher Dirtbike: We might be getting tino the right neighborhood.

1) Is there a regular expression for (eg.) "Delete everything on line 5 except text from columns 10 to 15 and columns 20 to 25; put a comma (or tab) between the two saved parts"?

2) Can I assemble a list of such regular expressions into a script which editpadpro (eg.) can run?

Thanks!
posted by ZenMasterThis at 10:40 AM on January 14, 2012


OK looks like the answer to the second question might be "macros."
posted by ZenMasterThis at 10:42 AM on January 14, 2012


Looks like the answer to the first question is probably yes.
posted by ZenMasterThis at 10:45 AM on January 14, 2012


Are you on Windows? Textpad has macros.

If you can change the first few lines manually to become a well-formatted CSV, then the macro could apply it to the whole document.
posted by User7 at 10:46 AM on January 14, 2012


When you say "column", are you referring to the character number on the line? I'm not an expert on regular expressions, but I believe quantifiers are what you want. Just match all characters from 1 to 9 with a quantified wildcard match, then return the next 5 characters in the same way (using parentheses, I believe, depending on how the editor uses regexps), etc, etc.
posted by Philosopher Dirtbike at 10:48 AM on January 14, 2012


I mean the character count from the start of the line, eg. the first character (including spaces, tabs, etc.) on the line is 1, the second 2, etc.
posted by ZenMasterThis at 10:53 AM on January 14, 2012


You're going off into crazy land and turning a 30 second problem into a drawn out nightmare.

Bite the bullet and download a scripting language, I would recommend Strawberry Perl, but others would prefer Python. Then you're done.
for my $filename (@ARGV) {
  open my $fh, '<', $filename or die "error: $!";
  my $line_number = 0;
  my %fields = ( filename => $filename );
  while (my $line = <$fh>) {
    $line_number++;
    if ($line_number == 1) {
      $field{field1} = substr $line, 10, 5;  # start character 10 (0 based) take 5
    }
    if ($line_number == 2 {
        $field{field2} = substr $line, 30, 5;
    }
  }
  open my $output, '>>', 'output.csv' or die "error: $!";
  print $output "$field{filename},$field{field1},$field{field2}$/";
}
A little work and all 100's of your files are in a CSV in the blink of an eye vs hours and hours of trying to figure out regular expressions and macros for some specific editor and doing each file.

You can probably simplify things greatly depending on your input file. Are the fields really from columnX to columnY (fixed width fields) or are they whitespace separated?

The reason you don't find a GUI program that does this via a simple Google search is that nobody would bother. It's a one-off bit of code that takes a minute to write and test, probably under a second to run over your 100s of files and would take 30 more minutes to put a generalized GUI on it.
posted by zengargoyle at 11:11 AM on January 14, 2012


are the values that you want delimited in any way (e.g. tab, comma, always preceded by a certain number of characters)? that makes it a lot easier, obviously.

I had a similar situation (poorly designed program spitting out broken tab delimited files) at work and wrote a short Ruby program in maybe a week, which ran on all the data files collected to that point to massage them so the stuff I wanted could be injected into an excel spreadsheet. This was much faster and more fun than hand-editing each file before opening it.
posted by ArgentCorvid at 11:19 AM on January 14, 2012


I should note that the week i mentioned it took me to write the script included doing other duties at work, and researching the OLE automation interface of excel so that it would also make graphs for me from the data.
posted by ArgentCorvid at 11:23 AM on January 14, 2012


Oh, man ... this is so obvious I can't believe I didn't think of this years ago ... all I need to do is create an Excel file with a bunch of Column B cells with text functions (eg. =MID(A3,23,5)) and import the TXT file into Column A. Then I can pull the Column B values into a second tab sheet in the format I want, and export that to CSV!

Thanks for the other suggestions ... I may yet need to use them!
posted by ZenMasterThis at 11:30 AM on January 14, 2012


Can't you make this even easier and define the file as having fixed width columns on import into Excel, then just delete the columns you don't need?
posted by Philosopher Dirtbike at 1:02 PM on January 14, 2012


No, because the position of the fields varies from line to line.
posted by ZenMasterThis at 1:57 PM on January 14, 2012


awk/sed/python/perl are usually the best answers for this.

One thing to check is if the values in your table are actually in different places or if the table is tab-separated, with nothing in the empty fields, and you're just seeing them as being in different places in notepad/wordpad.
posted by beerbajay at 3:11 PM on January 14, 2012


No, they're really in different places, but I think I've finally come up with a pretty good system for getting it all done in Excel using the MID() function.

Thanks!
posted by ZenMasterThis at 3:48 PM on January 14, 2012


Data Wrangler
posted by cosmologinaut at 11:42 PM on January 14, 2012


cosmologinaut: YES! THANKS!
posted by ZenMasterThis at 9:50 AM on January 15, 2012


« Older Need DOS Emulator   |   Race restrictions for apheresis? Newer »
This thread is closed to new comments.