How to parse a text file & collect a piece of each line automatically?
May 9, 2012 10:18 AM Subscribe
I need to go through text files, each containing hundreds of lines, and copy a specific part of each line, and then add all those parts together into one long line, each part separated by a comma. But how?
Basically, each file has hundreds of lines similar to these examples:
<a href="index.php?PageName=article_browser&VehicleID=5322&nodeInfo=abcdef
<a href="index.php?PageName=article_browser&VehicleID=3232&nodeInfo=ghjdsh
<a href="index.php?PageName=article_browser&VehicleID=2323&nodeInfo=dsdfdd
<a href="index.php?PageName=article_browser&VehicleID=5232&nodeInfo=dsdfds
What I need to do is get the numbers AFTER VehicleID= and before the following &, and then present them in one long line with commas:
5322,3232,2323,5232 [and so on...]
I have a PC and a Mac at my disposal so either is fine for this. I was going to create an Automator workflow that would find & replace the beginning of each sentence (up to VehicleID=) with a blank/comma, and then find & replace again starting at &.... Unfortunately this doesn't work because the rest of the line is always unique.
I haven't done any programming since Turbo Pascal. I looked at Xcode but it would take longer to learn how to do this with Xcode than it would to go through every line in a text processor.
This seems like such a simple thing, though, so maybe someone can think of something I could try.
Basically, each file has hundreds of lines similar to these examples:
<a href="index.php?PageName=article_browser&VehicleID=5322&nodeInfo=abcdef
<a href="index.php?PageName=article_browser&VehicleID=3232&nodeInfo=ghjdsh
<a href="index.php?PageName=article_browser&VehicleID=2323&nodeInfo=dsdfdd
<a href="index.php?PageName=article_browser&VehicleID=5232&nodeInfo=dsdfds
What I need to do is get the numbers AFTER VehicleID= and before the following &, and then present them in one long line with commas:
5322,3232,2323,5232 [and so on...]
I have a PC and a Mac at my disposal so either is fine for this. I was going to create an Automator workflow that would find & replace the beginning of each sentence (up to VehicleID=) with a blank/comma, and then find & replace again starting at &.... Unfortunately this doesn't work because the rest of the line is always unique.
I haven't done any programming since Turbo Pascal. I looked at Xcode but it would take longer to learn how to do this with Xcode than it would to go through every line in a text processor.
This seems like such a simple thing, though, so maybe someone can think of something I could try.
Can you dump it in a text file? WIthout seeing an entire file it's tough for me to visualise/explain
Do a search for <a href="index.php?PageName=article_browser&VehicleID= and replace with a blank space.
if the vehicle ID is ALWAYS 4 characters long, I'd probably then just paste a comma to manually replace every thing to the right of the number.
Alternatively import into excel and set the "delimiting" by hand, so before the number is one 'column', vehicle id is a second 'column', and other irrelevant stuff is third column.
Open, nuke first and third columns, resave as csv, then use a find and replace in text to replace the hard return with a comma.
posted by tilde at 10:28 AM on May 9, 2012
Do a search for <a href="index.php?PageName=article_browser&VehicleID= and replace with a blank space.
if the vehicle ID is ALWAYS 4 characters long, I'd probably then just paste a comma to manually replace every thing to the right of the number.
Alternatively import into excel and set the "delimiting" by hand, so before the number is one 'column', vehicle id is a second 'column', and other irrelevant stuff is third column.
Open, nuke first and third columns, resave as csv, then use a find and replace in text to replace the hard return with a comma.
posted by tilde at 10:28 AM on May 9, 2012
This is what Terminal on your Mac is for:
The above adds a trailing comma to the file which you might not need - would that be a problem?
posted by scruss at 10:28 AM on May 9, 2012 [1 favorite]
tr '&' '\012' < file.txt | grep VehicleID | sed 's/VehicleID=//' | tr '\012' ',' > newfile.txt
The above adds a trailing comma to the file which you might not need - would that be a problem?
posted by scruss at 10:28 AM on May 9, 2012 [1 favorite]
Macs have Unix shell tools, right? Then something like this should do the trick:
(Not tested.)
posted by stebulus at 10:29 AM on May 9, 2012
sed -e 's,^.*VehicleID=\([0-9]*\)&.*$,\1,' FILENAMES |tr '\n' ','
(Not tested.)
posted by stebulus at 10:29 AM on May 9, 2012
Got perl installed?
This will edit the files in-place and create a bunch of FILENAME.bak copies of the originals which you can safely delete.
posted by jozxyqk at 10:40 AM on May 9, 2012
perl -i.bak -pe 's/.*VehicleID=(\d+)&.*/$1/;' FILENAME1.html FILENAME2.html ...
This will edit the files in-place and create a bunch of FILENAME.bak copies of the originals which you can safely delete.
posted by jozxyqk at 10:40 AM on May 9, 2012
What you're looking for are regular expressions, also known as regex. There's a lot of software that offers regex support, and a lot of code libraries that offer it as well if you want something more adaptable.
In your case, you can probably trim the string using the Automator workflow you were describing, and then doing a find-replace for the following pattern should trim off the nodeInfo after the ID value:
&nodeInfo=\w*.*
posted by mikurski at 10:41 AM on May 9, 2012
In your case, you can probably trim the string using the Automator workflow you were describing, and then doing a find-replace for the following pattern should trim off the nodeInfo after the ID value:
&nodeInfo=\w*.*
posted by mikurski at 10:41 AM on May 9, 2012
Python:
python extract.py path/to/files/*
posted by kindall at 11:07 AM on May 9, 2012
from fileinput import input from re import findall open("outfile.txt", "w").write( ",".join(findall(r"(?<=VehicleID=)\d+", line)[0] for line in input()))Then just invoke the script passing the files you want to handle on the command line:
python extract.py path/to/files/*
posted by kindall at 11:07 AM on May 9, 2012
I'm not a programmer, so this might not be the most elegant method, but I'd do this:
1. Copy the text file into Excel.
2. Run a Text to Columns using "&" as the delimiter. This should give you "VehicleID=5322" in column B.
3. Delete all columns other than B.
4. In another column use =mid(B1,11,4) Use a higher number for that last value if they aren't all 4 digits.
5. Copy that all the way down.
6. Copy those values to a Word doc.
7. Do a find and replace, replacing the paragraph symbol with a comma.
Done.
posted by etc. at 11:47 AM on May 9, 2012 [2 favorites]
1. Copy the text file into Excel.
2. Run a Text to Columns using "&" as the delimiter. This should give you "VehicleID=5322" in column B.
3. Delete all columns other than B.
4. In another column use =mid(B1,11,4) Use a higher number for that last value if they aren't all 4 digits.
5. Copy that all the way down.
6. Copy those values to a Word doc.
7. Do a find and replace, replacing the paragraph symbol with a comma.
Done.
posted by etc. at 11:47 AM on May 9, 2012 [2 favorites]
There is a text editor called UltraEdit ( http://www.ultraedit.com/ ).. they have a free trial available for PC and MAC.
They have a 'Column Mode' switch which will let you select a vertical block of text. With this you can do what you want in about 15 seconds.
Download the trial.. open your file.
Column > Column Mode
Select the text before your values. Hit DEL
Select the text after your values (if there is any) Hit DEL
Select a 1 character column after your data and hit ,
(you will now have a column of your data with commas after)
Column > Column Mode (to exit column mode)
Select your whole document CTRL-A
Format > Convert CR/LFs to Wrap
Good luck =)
posted by TeknoKid at 11:47 AM on May 9, 2012 [3 favorites]
They have a 'Column Mode' switch which will let you select a vertical block of text. With this you can do what you want in about 15 seconds.
Download the trial.. open your file.
Column > Column Mode
Select the text before your values. Hit DEL
Select the text after your values (if there is any) Hit DEL
Select a 1 character column after your data and hit ,
(you will now have a column of your data with commas after)
Column > Column Mode (to exit column mode)
Select your whole document CTRL-A
Format > Convert CR/LFs to Wrap
Good luck =)
posted by TeknoKid at 11:47 AM on May 9, 2012 [3 favorites]
perl -l054 -pe 's/.*VehicleID=(\d+).*/\1/' file1 file2 ...
posted by nicwolff at 11:50 AM on May 9, 2012 [1 favorite]
The secret sauce there is that perl's -l option takes an optional octal character number to separate output lines with instead of the default "\n".
posted by nicwolff at 11:54 AM on May 9, 2012
posted by nicwolff at 11:54 AM on May 9, 2012
Another option would be
Depending on the input format you might have a trailing comma after each file. This would be for the Mac in the terminal.
posted by EsotericAlgorithm at 12:09 AM on May 10, 2012
cat filename1 filename2... | cut -c 55-58 | tr -s '\n' ','
Depending on the input format you might have a trailing comma after each file. This would be for the Mac in the terminal.
posted by EsotericAlgorithm at 12:09 AM on May 10, 2012
OS X has Perl so you're good to go there. Here's a better Perl-Fu (for some flavor of better) solution:
A more code-y and much better solution using the Way Cool Mojolicious / Mojo framework:
posted by zengargoyle at 4:51 AM on May 10, 2012
perl -lne '/.*VehicleID=(\d+).*/&&push@I,$1;}{$"=",";print"@I"' *.html
This actually does the requested present them in one long line with commas:E.g.:
5322,3232,2323,5232 [and so on...]
$ perl -lne '/.*VehicleID=(\d+).*/&&push@I,$1;}{$"=",";print"@I"' *.html
5322,3232,2323,5232,5322,3232,2323,5232,5322,3232,2323,5232
(copied the OP's HTML (making proper <a> fragments) into {a,b,c}.html, hence the duplication of numbers)A more code-y and much better solution using the Way Cool Mojolicious / Mojo framework:
use strict;
use Mojo;
# slurp in files on command line
my $html = join '', <>;
my $ids =
# create a DOM
Mojo::DOM->new($html)
# find <a> links and process each
->a->map(sub{
# create a URL from the href attribute
Mojo::URL->new($_[0]->{href})
# extract query string
->query
# find the VehicleID param or return nothing
->param('VehicleID') || ()
})
# sqish them together
->join(',');
print "$ids\n";
posted by zengargoyle at 4:51 AM on May 10, 2012
This thread is closed to new comments.
posted by sarahnicolesays at 10:22 AM on May 9, 2012