Shell scripting or something better?
May 9, 2008 2:19 PM   Subscribe

Faster techniques for grabbing lines of a file and doing text manipulation and arithmetic via shell (or other) scripting?

1. Are there faster ways to grab lines of a text file?

Currently I use while read line and sed -n #p filename in bash and csh scripts to grab lines of a file I'm interested in. This seems slow. Are there better (faster) ways to get the line of a file, or to iterate through specified ranges of lines in a file?

2. Are there faster ways than awk to grab values in a line?

Let's say I have the tab-delimited line:

abc 123 345 0.52

What I'd like to do is get the second and third values, or the fourth value, as quickly as possible. Is there a better way than awk? Will a perl or other interpreted language script run faster than a shell script for scraping values from a text file?

3. Arithmetic with bash?

I've been doing $((${value1}+${value2})) for integer arithmetic and calc ${value1} / ${value2} for floating point arithmetic within bash. Will I gain a performance benefit from switching over my code from bash to another shell script language, or to another interpreted language entirely?

Thanks for any and all tips and tricks.
posted by Blazecock Pileon to Computers & Internet (20 answers total) 3 users marked this as a favorite
 
I'm no expert in my known programming language, but if you linked to some sample file, I'd be happy to write a short script to parse and time it.
posted by wongcorgi at 2:42 PM on May 9, 2008


for #2 like jobs, I generally use "cut". You can define cuts by character position, or field-position, as well as defining your separator character, eg.

cut -d '\i' -f 2,3 <> outfile.out
posted by nomisxid at 3:09 PM on May 9, 2008


Best answer: d'oeth, that example should be

cut -d '\i' -f 2,3 < tabfile.in > outfile.out
posted by nomisxid at 3:10 PM on May 9, 2008


The good thing about shell scripts is that you're already running in the interpreter (so to speak). Perl and other external scripty languages will require a separate interpreter to be run. While Perl may internally be faster at some things, bash/csh/sed are also coded in C and plenty fast on their own, so the differences should be minimal at best.

For your `calc` lines, hopefully you're using $(..) conventions and not backticks so that the shell does not fork in order to do the calc.
posted by rhizome at 3:28 PM on May 9, 2008


Perl really wouldn't be that much faster for any one of these tasks, but for all of them together, you can do them with one perl script. If you do it without calling any external programs, it will be much faster, especially if you do it all at once.

I still write the occasional shell script but anything of this complexity is much easier in perl.
posted by jefftang at 3:29 PM on May 9, 2008


I'm in agreement with jefftang, but I'll add that you never know until you try. If you're looking for an excuse to try something like perl or python, go for it.
posted by PueExMachina at 4:10 PM on May 9, 2008


Response by poster: I'd like to know if I should redirect my coding efforts from bash/csh to those other languages, if there is a concrete and not a hey-you-should-try-it-out-for-the-hell-of-it performance improvement for these specific activities.

I already know enough python to rewrite my scripts, but I'm under a deadline and don't want to commit what will be substantial time to script writing and testing unless there is a definite performance improvement for these tasks from doing so.
posted by Blazecock Pileon at 4:41 PM on May 9, 2008


Perl! It saves all of the forks and startup times, it was made for this stuff, nothing is better unless you want to write C code for your specific endeavor.

$ perldoc perl
$ perldoc perlintro

(Yes I spend my days converting butt-slow shell scripts into Perl.)
posted by zengargoyle at 4:44 PM on May 9, 2008


On lack of preview, yes. Almost any scripting that avoids forks and startup of external programs will be faster than shell scripting. I'm a Perl dude (go figure) because it was around 10 years ago when I needed something, it's mature and sometimes obscure. But even Python would end up being faster (most likely) than longish shell scripting.
posted by zengargoyle at 4:50 PM on May 9, 2008 [1 favorite]


1. Are there faster ways to grab lines of a text file?

It depends how many times you want to do it. If you are repeatedly grabbing lines then you are going to pay a cost firing sed up over and over and over again. If you're going to be doing it a lot you're much better off paying the startup cost for python, reading the entire file into a buffer in a single shot, and doing all of your grabs out of the buffer.

2. Are there faster ways than awk to grab values in a line?

Similar to the above answer -- there is a cost associated with starting up awk. If you only need to do it a single time then awk is fine (although chopping the line up using shell variable slicing would probably be faster as you wouldn't have to start an executable)

3. Arithmetic with bash?

You won't see much speed difference between arithmetic in bash and any other scripting language.
posted by tkolar at 4:52 PM on May 9, 2008


Best answer: I created a 100,000 line file, where each line contains a random number between 1 and 10,000,000. I used the following scripts to add it up (sorry if the bash isn't optimal, I'm no guru.)


#!/bin/bash
acc=1
while read line; do acc=$((1000000 + $line)); done



#!/usr/bin/env python
import sys
acc = 1
for line in open(sys.argv[1], 'r'):
acc = 1000000 + int(line) # Indent this line


Here's the results I got:


mayu:~ mark$ time ./addup.sh < rndlines
real 0m12.396s
user 0m9.873s
sys 0m1.752s

mayu:~ mark$ time ./addup.py rndlines
real 0m0.331s
user 0m0.258s
sys 0m0.033s


Go ahead and port your code.
posted by pocams at 4:58 PM on May 9, 2008


I'd expect what's faster to depend on the details of what you're doing, and you'd have to benchmark to be sure. If what you're doing involves lots of invocations of external programs (anything that's not a bash internal), my guess would be that Perl or Python would be faster.

Perl was designed to (among other things) be an alternative to sed and awk in Unix command line pipelines, e.g. for your #2

perl -F"\t" -ane 'print $F[1]+$F[2],"\n"' yourfilename

would print 567 (i.e., 123+456).

But under the deadline situation you mention, I wouldn't recommend switching.
posted by Zed_Lopez at 5:00 PM on May 9, 2008


But you'd probably best ignore the advice of someone who states publicly that 123+456=567. Doh.
posted by Zed_Lopez at 5:02 PM on May 9, 2008


I find Python to be exemplary at these sorts of tasks. See David Beazley’s Generator Tricks For Systems Programmers for some neat tricks. String matching in Python is very high-performance compared to other “scripting” languages, and I’ve always found arithmetic cumbersome in the shell. If you don’t know Python already, you certainly won’t regret learning it.
posted by ijoshua at 5:06 PM on May 9, 2008


(I’m also willing to write a speed test in Python if you can provide data.)
posted by ijoshua at 5:08 PM on May 9, 2008


Best answer: For your question #2, I compared a simple Python script (for line in open()..: print line.split()[2]) with awk:


mayu:~ mark$ time awk '{ print $3 }' rndtabs > /dev/null
real 0m0.928s
user 0m0.847s
sys 0m0.022s

mayu:~ mark$ time ./print3.py rndtabs > /dev/null
real 0m0.634s
user 0m0.535s
sys 0m0.044s


I ran them a few times in case caching was having an effect, but there was no real change. (Of course, these tests are probably just as fast in Perl, Ruby, or whatever language you like - I'm just comparing vs. shell scripting.)
posted by pocams at 5:10 PM on May 9, 2008


Best answer: Yeah right....

$ time for i in $(seq 1 100);do echo "scale=3;$i/4"| bc; done > /dev/null

real 0m0.738s
user 0m0.280s
sys 0m0.460s

$ time perl -e 'for(1..100){printf "%0.3f\n", $_/4}' > /dev/null

real 0m0.029s
user 0m0.010s
sys 0m0.020s
posted by zengargoyle at 5:10 PM on May 9, 2008


Best answer: pocams' test

$ time while read line; do acc=$((1000000 + $line)); done <>
real 0m6.472s
user 0m5.940s
sys 0m0.470s

$ time perl -Mbignum -e '$acc=1000000+$_' <>
real 0m0.246s
user 0m0.170s
sys 0m0.070s

Scripting of any sort wins!
posted by zengargoyle at 5:43 PM on May 9, 2008


Sorry, bad HTML.. that's < nums.txt at the end there...
posted by zengargoyle at 5:45 PM on May 9, 2008


All the perl/python solutions have already come out, and yes, it seems like either would be a worthy substitute for sh/csh/bash/etc..

My thought -

How big is this text file? Too big to read it into memory?

The reason I ask is this - your biggest bottleneck isn't the parsing of each line - it's the reading of the file one line at a time. Yes, there are buffers at several stages, etc, but you'll be eventually hitting the disk for a lot of small reads.

You'll see a huge improvement if you suck the whole thing into memory first, and then operate on it one line at a time from there.
posted by swngnmonk at 7:43 AM on May 10, 2008


« Older Where can I buy caffeine-free Coca Cola in the UK?   |   Best free budgeting software when being paid twice... Newer »
This thread is closed to new comments.