Shell scripting or something better?
May 9, 2008 2:19 PM Subscribe
Faster techniques for grabbing lines of a file and doing text manipulation and arithmetic via shell (or other) scripting?
1. Are there faster ways to grab lines of a text file?
Currently I use
2. Are there faster ways than
Let's say I have the tab-delimited line:
What I'd like to do is get the second and third values, or the fourth value, as quickly as possible. Is there a better way than
3. Arithmetic with
I've been doing
Thanks for any and all tips and tricks.
1. Are there faster ways to grab lines of a text file?
Currently I use
while read line
and sed -n #p filename
in bash
and csh
scripts to grab lines of a file I'm interested in. This seems slow. Are there better (faster) ways to get the line of a file, or to iterate through specified ranges of lines in a file?2. Are there faster ways than
awk
to grab values in a line?Let's say I have the tab-delimited line:
abc 123 345 0.52
What I'd like to do is get the second and third values, or the fourth value, as quickly as possible. Is there a better way than
awk
? Will a perl or other interpreted language script run faster than a shell script for scraping values from a text file?3. Arithmetic with
bash
?I've been doing
$((${value1}+${value2}))
for integer arithmetic and calc ${value1} / ${value2}
for floating point arithmetic within bash
. Will I gain a performance benefit from switching over my code from bash
to another shell script language, or to another interpreted language entirely?Thanks for any and all tips and tricks.
for #2 like jobs, I generally use "cut". You can define cuts by character position, or field-position, as well as defining your separator character, eg.
cut -d '\i' -f 2,3 <> outfile.out>
posted by nomisxid at 3:09 PM on May 9, 2008
cut -d '\i' -f 2,3 <> outfile.out>
posted by nomisxid at 3:09 PM on May 9, 2008
Best answer: d'oeth, that example should be
cut -d '\i' -f 2,3 < tabfile.in > outfile.out
posted by nomisxid at 3:10 PM on May 9, 2008
cut -d '\i' -f 2,3 < tabfile.in > outfile.out
posted by nomisxid at 3:10 PM on May 9, 2008
The good thing about shell scripts is that you're already running in the interpreter (so to speak). Perl and other external scripty languages will require a separate interpreter to be run. While Perl may internally be faster at some things, bash/csh/sed are also coded in C and plenty fast on their own, so the differences should be minimal at best.
For your `calc` lines, hopefully you're using $(..) conventions and not backticks so that the shell does not fork in order to do the calc.
posted by rhizome at 3:28 PM on May 9, 2008
For your `calc` lines, hopefully you're using $(..) conventions and not backticks so that the shell does not fork in order to do the calc.
posted by rhizome at 3:28 PM on May 9, 2008
Perl really wouldn't be that much faster for any one of these tasks, but for all of them together, you can do them with one perl script. If you do it without calling any external programs, it will be much faster, especially if you do it all at once.
I still write the occasional shell script but anything of this complexity is much easier in perl.
posted by jefftang at 3:29 PM on May 9, 2008
I still write the occasional shell script but anything of this complexity is much easier in perl.
posted by jefftang at 3:29 PM on May 9, 2008
I'm in agreement with jefftang, but I'll add that you never know until you try. If you're looking for an excuse to try something like perl or python, go for it.
posted by PueExMachina at 4:10 PM on May 9, 2008
posted by PueExMachina at 4:10 PM on May 9, 2008
Response by poster: I'd like to know if I should redirect my coding efforts from bash/csh to those other languages, if there is a concrete and not a hey-you-should-try-it-out-for-the-hell-of-it performance improvement for these specific activities.
I already know enough python to rewrite my scripts, but I'm under a deadline and don't want to commit what will be substantial time to script writing and testing unless there is a definite performance improvement for these tasks from doing so.
posted by Blazecock Pileon at 4:41 PM on May 9, 2008
I already know enough python to rewrite my scripts, but I'm under a deadline and don't want to commit what will be substantial time to script writing and testing unless there is a definite performance improvement for these tasks from doing so.
posted by Blazecock Pileon at 4:41 PM on May 9, 2008
Perl! It saves all of the forks and startup times, it was made for this stuff, nothing is better unless you want to write C code for your specific endeavor.
$ perldoc perl
$ perldoc perlintro
(Yes I spend my days converting butt-slow shell scripts into Perl.)
posted by zengargoyle at 4:44 PM on May 9, 2008
$ perldoc perl
$ perldoc perlintro
(Yes I spend my days converting butt-slow shell scripts into Perl.)
posted by zengargoyle at 4:44 PM on May 9, 2008
On lack of preview, yes. Almost any scripting that avoids forks and startup of external programs will be faster than shell scripting. I'm a Perl dude (go figure) because it was around 10 years ago when I needed something, it's mature and sometimes obscure. But even Python would end up being faster (most likely) than longish shell scripting.
posted by zengargoyle at 4:50 PM on May 9, 2008 [1 favorite]
posted by zengargoyle at 4:50 PM on May 9, 2008 [1 favorite]
1. Are there faster ways to grab lines of a text file?
It depends how many times you want to do it. If you are repeatedly grabbing lines then you are going to pay a cost firing sed up over and over and over again. If you're going to be doing it a lot you're much better off paying the startup cost for python, reading the entire file into a buffer in a single shot, and doing all of your grabs out of the buffer.
2. Are there faster ways than awk to grab values in a line?
Similar to the above answer -- there is a cost associated with starting up awk. If you only need to do it a single time then awk is fine (although chopping the line up using shell variable slicing would probably be faster as you wouldn't have to start an executable)
3. Arithmetic with bash?
You won't see much speed difference between arithmetic in bash and any other scripting language.
posted by tkolar at 4:52 PM on May 9, 2008
It depends how many times you want to do it. If you are repeatedly grabbing lines then you are going to pay a cost firing sed up over and over and over again. If you're going to be doing it a lot you're much better off paying the startup cost for python, reading the entire file into a buffer in a single shot, and doing all of your grabs out of the buffer.
2. Are there faster ways than awk to grab values in a line?
Similar to the above answer -- there is a cost associated with starting up awk. If you only need to do it a single time then awk is fine (although chopping the line up using shell variable slicing would probably be faster as you wouldn't have to start an executable)
3. Arithmetic with bash?
You won't see much speed difference between arithmetic in bash and any other scripting language.
posted by tkolar at 4:52 PM on May 9, 2008
Best answer: I created a 100,000 line file, where each line contains a random number between 1 and 10,000,000. I used the following scripts to add it up (sorry if the bash isn't optimal, I'm no guru.)
Here's the results I got:
Go ahead and port your code.
posted by pocams at 4:58 PM on May 9, 2008
#!/bin/bash
acc=1
while read line; do acc=$((1000000 + $line)); done
#!/usr/bin/env python
import sys
acc = 1
for line in open(sys.argv[1], 'r'):
acc = 1000000 + int(line) # Indent this line
Here's the results I got:
mayu:~ mark$ time ./addup.sh < rndlines
real 0m12.396s
user 0m9.873s
sys 0m1.752s
mayu:~ mark$ time ./addup.py rndlines
real 0m0.331s
user 0m0.258s
sys 0m0.033s
Go ahead and port your code.
posted by pocams at 4:58 PM on May 9, 2008
I'd expect what's faster to depend on the details of what you're doing, and you'd have to benchmark to be sure. If what you're doing involves lots of invocations of external programs (anything that's not a bash internal), my guess would be that Perl or Python would be faster.
Perl was designed to (among other things) be an alternative to sed and awk in Unix command line pipelines, e.g. for your #2
perl -F"\t" -ane 'print $F[1]+$F[2],"\n"' yourfilename
would print 567 (i.e., 123+456).
But under the deadline situation you mention, I wouldn't recommend switching.
posted by Zed_Lopez at 5:00 PM on May 9, 2008
Perl was designed to (among other things) be an alternative to sed and awk in Unix command line pipelines, e.g. for your #2
perl -F"\t" -ane 'print $F[1]+$F[2],"\n"' yourfilename
would print 567 (i.e., 123+456).
But under the deadline situation you mention, I wouldn't recommend switching.
posted by Zed_Lopez at 5:00 PM on May 9, 2008
But you'd probably best ignore the advice of someone who states publicly that 123+456=567. Doh.
posted by Zed_Lopez at 5:02 PM on May 9, 2008
posted by Zed_Lopez at 5:02 PM on May 9, 2008
I find Python to be exemplary at these sorts of tasks. See David Beazley’s Generator Tricks For Systems Programmers for some neat tricks. String matching in Python is very high-performance compared to other “scripting” languages, and I’ve always found arithmetic cumbersome in the shell. If you don’t know Python already, you certainly won’t regret learning it.
posted by ijoshua at 5:06 PM on May 9, 2008
posted by ijoshua at 5:06 PM on May 9, 2008
(I’m also willing to write a speed test in Python if you can provide data.)
posted by ijoshua at 5:08 PM on May 9, 2008
posted by ijoshua at 5:08 PM on May 9, 2008
Best answer: For your question #2, I compared a simple Python script (for line in open()..: print line.split()[2]) with awk:
I ran them a few times in case caching was having an effect, but there was no real change. (Of course, these tests are probably just as fast in Perl, Ruby, or whatever language you like - I'm just comparing vs. shell scripting.)
posted by pocams at 5:10 PM on May 9, 2008
mayu:~ mark$ time awk '{ print $3 }' rndtabs > /dev/null
real 0m0.928s
user 0m0.847s
sys 0m0.022s
mayu:~ mark$ time ./print3.py rndtabs > /dev/null
real 0m0.634s
user 0m0.535s
sys 0m0.044s
I ran them a few times in case caching was having an effect, but there was no real change. (Of course, these tests are probably just as fast in Perl, Ruby, or whatever language you like - I'm just comparing vs. shell scripting.)
posted by pocams at 5:10 PM on May 9, 2008
Best answer: Yeah right....
$ time for i in $(seq 1 100);do echo "scale=3;$i/4"| bc; done > /dev/null
real 0m0.738s
user 0m0.280s
sys 0m0.460s
$ time perl -e 'for(1..100){printf "%0.3f\n", $_/4}' > /dev/null
real 0m0.029s
user 0m0.010s
sys 0m0.020s
posted by zengargoyle at 5:10 PM on May 9, 2008
$ time for i in $(seq 1 100);do echo "scale=3;$i/4"| bc; done > /dev/null
real 0m0.738s
user 0m0.280s
sys 0m0.460s
$ time perl -e 'for(1..100){printf "%0.3f\n", $_/4}' > /dev/null
real 0m0.029s
user 0m0.010s
sys 0m0.020s
posted by zengargoyle at 5:10 PM on May 9, 2008
Best answer: pocams' test
$ time while read line; do acc=$((1000000 + $line)); done <>
real 0m6.472s
user 0m5.940s
sys 0m0.470s
$ time perl -Mbignum -e '$acc=1000000+$_' <>
real 0m0.246s
user 0m0.170s
sys 0m0.070s
Scripting of any sort wins!>>
posted by zengargoyle at 5:43 PM on May 9, 2008
$ time while read line; do acc=$((1000000 + $line)); done <>
real 0m6.472s
user 0m5.940s
sys 0m0.470s
$ time perl -Mbignum -e '$acc=1000000+$_' <>
real 0m0.246s
user 0m0.170s
sys 0m0.070s
Scripting of any sort wins!>>
posted by zengargoyle at 5:43 PM on May 9, 2008
Sorry, bad HTML.. that's < nums.txt at the end there...
posted by zengargoyle at 5:45 PM on May 9, 2008
posted by zengargoyle at 5:45 PM on May 9, 2008
All the perl/python solutions have already come out, and yes, it seems like either would be a worthy substitute for sh/csh/bash/etc..
My thought -
How big is this text file? Too big to read it into memory?
The reason I ask is this - your biggest bottleneck isn't the parsing of each line - it's the reading of the file one line at a time. Yes, there are buffers at several stages, etc, but you'll be eventually hitting the disk for a lot of small reads.
You'll see a huge improvement if you suck the whole thing into memory first, and then operate on it one line at a time from there.
posted by swngnmonk at 7:43 AM on May 10, 2008
My thought -
How big is this text file? Too big to read it into memory?
The reason I ask is this - your biggest bottleneck isn't the parsing of each line - it's the reading of the file one line at a time. Yes, there are buffers at several stages, etc, but you'll be eventually hitting the disk for a lot of small reads.
You'll see a huge improvement if you suck the whole thing into memory first, and then operate on it one line at a time from there.
posted by swngnmonk at 7:43 AM on May 10, 2008
« Older Where can I buy caffeine-free Coca Cola in the UK? | Best free budgeting software when being paid twice... Newer »
This thread is closed to new comments.
posted by wongcorgi at 2:42 PM on May 9, 2008