How to parse a text file & collect a piece of each line automatically?
May 9, 2012 10:18 AM   Subscribe

I need to go through text files, each containing hundreds of lines, and copy a specific part of each line, and then add all those parts together into one long line, each part separated by a comma. But how?

Basically, each file has hundreds of lines similar to these examples:

<a href="index.php?PageName=article_browser&VehicleID=5322&nodeInfo=abcdef
<a href="index.php?PageName=article_browser&VehicleID=3232&nodeInfo=ghjdsh
<a href="index.php?PageName=article_browser&VehicleID=2323&nodeInfo=dsdfdd
<a href="index.php?PageName=article_browser&VehicleID=5232&nodeInfo=dsdfds

What I need to do is get the numbers AFTER VehicleID= and before the following &amp, and then present them in one long line with commas:
5322,3232,2323,5232 [and so on...]

I have a PC and a Mac at my disposal so either is fine for this. I was going to create an Automator workflow that would find & replace the beginning of each sentence (up to VehicleID=) with a blank/comma, and then find & replace again starting at &amp.... Unfortunately this doesn't work because the rest of the line is always unique.

I haven't done any programming since Turbo Pascal. I looked at Xcode but it would take longer to learn how to do this with Xcode than it would to go through every line in a text processor.

This seems like such a simple thing, though, so maybe someone can think of something I could try.
posted by Unhyper to Computers & Internet (13 answers total) 4 users marked this as a favorite
 
hmm. What about importing the files into excel, using text to column options to filter out a column with the needed data and then save as csv? Unless you have so many files that this is unwieldy, in which case you'll need to write a script.
posted by sarahnicolesays at 10:22 AM on May 9, 2012


Can you dump it in a text file? WIthout seeing an entire file it's tough for me to visualise/explain

Do a search for <a href="index.php?PageName=article_browser&VehicleID= and replace with a blank space.

if the vehicle ID is ALWAYS 4 characters long, I'd probably then just paste a comma to manually replace every thing to the right of the number.

Alternatively import into excel and set the "delimiting" by hand, so before the number is one 'column', vehicle id is a second 'column', and other irrelevant stuff is third column.

Open, nuke first and third columns, resave as csv, then use a find and replace in text to replace the hard return with a comma.
posted by tilde at 10:28 AM on May 9, 2012


This is what Terminal on your Mac is for:

tr '&' '\012' < file.txt | grep VehicleID | sed 's/VehicleID=//' | tr '\012' ',' > newfile.txt

The above adds a trailing comma to the file which you might not need - would that be a problem?
posted by scruss at 10:28 AM on May 9, 2012 [1 favorite]


Macs have Unix shell tools, right? Then something like this should do the trick:

sed -e 's,^.*VehicleID=\([0-9]*\)&amp;.*$,\1,' FILENAMES |tr '\n' ','

(Not tested.)
posted by stebulus at 10:29 AM on May 9, 2012


Got perl installed?
perl -i.bak -pe 's/.*VehicleID=(\d+)&.*/$1/;' FILENAME1.html FILENAME2.html ...

This will edit the files in-place and create a bunch of FILENAME.bak copies of the originals which you can safely delete.
posted by jozxyqk at 10:40 AM on May 9, 2012


What you're looking for are regular expressions, also known as regex. There's a lot of software that offers regex support, and a lot of code libraries that offer it as well if you want something more adaptable.

In your case, you can probably trim the string using the Automator workflow you were describing, and then doing a find-replace for the following pattern should trim off the nodeInfo after the ID value:

&nodeInfo=\w*.*
posted by mikurski at 10:41 AM on May 9, 2012


Python:
from fileinput import input
from re import findall

open("outfile.txt", "w").write(
    ",".join(findall(r"(?<=VehicleID=)\d+", line)[0] for line in input()))
Then just invoke the script passing the files you want to handle on the command line:

python extract.py path/to/files/*
posted by kindall at 11:07 AM on May 9, 2012


I'm not a programmer, so this might not be the most elegant method, but I'd do this:

1. Copy the text file into Excel.
2. Run a Text to Columns using "&" as the delimiter. This should give you "VehicleID=5322" in column B.
3. Delete all columns other than B.
4. In another column use =mid(B1,11,4) Use a higher number for that last value if they aren't all 4 digits.
5. Copy that all the way down.
6. Copy those values to a Word doc.
7. Do a find and replace, replacing the paragraph symbol with a comma.

Done.
posted by etc. at 11:47 AM on May 9, 2012 [2 favorites]


There is a text editor called UltraEdit ( http://www.ultraedit.com/ ).. they have a free trial available for PC and MAC.

They have a 'Column Mode' switch which will let you select a vertical block of text. With this you can do what you want in about 15 seconds.

Download the trial.. open your file.

Column > Column Mode
Select the text before your values. Hit DEL
Select the text after your values (if there is any) Hit DEL
Select a 1 character column after your data and hit ,
(you will now have a column of your data with commas after)
Column > Column Mode (to exit column mode)
Select your whole document CTRL-A
Format > Convert CR/LFs to Wrap

Good luck =)
posted by TeknoKid at 11:47 AM on May 9, 2012 [3 favorites]


perl -l054 -pe 's/.*VehicleID=(\d+).*/\1/' file1 file2 ...
posted by nicwolff at 11:50 AM on May 9, 2012 [1 favorite]


The secret sauce there is that perl's -l option takes an optional octal character number to separate output lines with instead of the default "\n".
posted by nicwolff at 11:54 AM on May 9, 2012


Another option would be cat filename1 filename2... | cut -c 55-58 | tr -s '\n' ','
Depending on the input format you might have a trailing comma after each file. This would be for the Mac in the terminal.
posted by EsotericAlgorithm at 12:09 AM on May 10, 2012


OS X has Perl so you're good to go there. Here's a better Perl-Fu (for some flavor of better) solution:

perl -lne '/.*VehicleID=(\d+).*/&&push@I,$1;}{$"=",";print"@I"' *.html
This actually does the requested
present them in one long line with commas:
5322,3232,2323,5232 [and so on...]
E.g.:

$ perl -lne '/.*VehicleID=(\d+).*/&&push@I,$1;}{$"=",";print"@I"' *.html
5322,3232,2323,5232,5322,3232,2323,5232,5322,3232,2323,5232
(copied the OP's HTML (making proper <a> fragments) into {a,b,c}.html, hence the duplication of numbers)

A more code-y and much better solution using the Way Cool Mojolicious / Mojo framework:

use strict;
use Mojo;

# slurp in files on command line
my $html = join '', <>;

my $ids =
    # create a DOM
    Mojo::DOM->new($html)
        # find <a> links and process each
        ->a->map(sub{
                # create a URL from the href attribute
                Mojo::URL->new($_[0]->{href})
                # extract query string
                ->query
                # find the VehicleID param or return nothing
                ->param('VehicleID') || ()
            })
            # sqish them together
            ->join(',');
print "$ids\n";

posted by zengargoyle at 4:51 AM on May 10, 2012


« Older Big dude needs strong wheel!   |   I moved. The cheese stayed where it was. Now it... Newer »
This thread is closed to new comments.