Comments on: Shell scripting or something better?

Question: Shell scripting or something better?

Blazecock Pileon — Fri, 09 May 2008 14:19:40 -0800

Faster techniques for grabbing lines of a file and doing text manipulation and arithmetic via shell (or other) scripting?

1. Are there faster ways to grab lines of a text file?

Currently I use while read line and sed -n #p filename in bash and csh scripts to grab lines of a file I'm interested in. This seems slow. Are there better (faster) ways to get the line of a file, or to iterate through specified ranges of lines in a file?

2. Are there faster ways than awk to grab values in a line?

Let's say I have the tab-delimited line:

abc 123 345 0.52

What I'd like to do is get the second and third values, or the fourth value, as quickly as possible. Is there a better way than awk? Will a perl or other interpreted language script run faster than a shell script for scraping values from a text file?

3. Arithmetic with bash?

I've been doing $((${value1}+${value2})) for integer arithmetic and calc ${value1} / ${value2} for floating point arithmetic within bash. Will I gain a performance benefit from switching over my code from bash to another shell script language, or to another interpreted language entirely?

Thanks for any and all tips and tricks.

By: wongcorgi

wongcorgi — Fri, 09 May 2008 14:42:10 -0800

I'm no expert in my known programming language, but if you linked to some sample file, I'd be happy to write a short script to parse and time it.

By: nomisxid

nomisxid — Fri, 09 May 2008 15:09:14 -0800

for #2 like jobs, I generally use "cut". You can define cuts by character position, or field-position, as well as defining your separator character, eg.

cut -d '\i' -f 2,3 <> outfile.out

By: nomisxid

nomisxid — Fri, 09 May 2008 15:10:42 -0800

d'oeth, that example should be

cut -d '\i' -f 2,3 < tabfile.in > outfile.out

By: rhizome

rhizome — Fri, 09 May 2008 15:28:14 -0800

The good thing about shell scripts is that you're already running in the interpreter (so to speak). Perl and other external scripty languages will require a separate interpreter to be run. While Perl may internally be faster at some things, bash/csh/sed are also coded in C and plenty fast on their own, so the differences should be minimal at best.

For your `calc` lines, hopefully you're using $(..) conventions and not backticks so that the shell does not fork in order to do the calc.

By: jefftang

jefftang — Fri, 09 May 2008 15:29:24 -0800

Perl really wouldn't be that much faster for any one of these tasks, but for all of them together, you can do them with one perl script. If you do it without calling any external programs, it will be much faster, especially if you do it all at once.

I still write the occasional shell script but anything of this complexity is much easier in perl.

By: PueExMachina

PueExMachina — Fri, 09 May 2008 16:10:19 -0800

I'm in agreement with jefftang, but I'll add that you never know until you try. If you're looking for an excuse to try something like perl or python, go for it.

By: Blazecock Pileon

Blazecock Pileon — Fri, 09 May 2008 16:41:07 -0800

I'd like to know if I should redirect my coding efforts from bash/csh to those other languages, if there is a concrete and not a hey-you-should-try-it-out-for-the-hell-of-it performance improvement for these specific activities.

I already know enough python to rewrite my scripts, but I'm under a deadline and don't want to commit what will be substantial time to script writing and testing unless there is a definite performance improvement for these tasks from doing so.

By: zengargoyle

zengargoyle — Fri, 09 May 2008 16:44:04 -0800

Perl! It saves all of the forks and startup times, it was made for this stuff, nothing is better unless you want to write C code for your specific endeavor.

$ perldoc perl
$ perldoc perlintro

(Yes I spend my days converting butt-slow shell scripts into Perl.)

By: zengargoyle

zengargoyle — Fri, 09 May 2008 16:50:00 -0800

On lack of preview, yes. Almost any scripting that avoids forks and startup of external programs will be faster than shell scripting. I'm a Perl dude (go figure) because it was around 10 years ago when I needed something, it's mature and sometimes obscure. But even Python would end up being faster (most likely) than longish shell scripting.

By: tkolar

tkolar — Fri, 09 May 2008 16:52:00 -0800

1. Are there faster ways to grab lines of a text file?

It depends how many times you want to do it. If you are repeatedly grabbing lines then you are going to pay a cost firing sed up over and over and over again. If you're going to be doing it a lot you're much better off paying the startup cost for python, reading the entire file into a buffer in a single shot, and doing all of your grabs out of the buffer.

2. Are there faster ways than awk to grab values in a line?

Similar to the above answer -- there is a cost associated with starting up awk. If you only need to do it a single time then awk is fine (although chopping the line up using shell variable slicing would probably be faster as you wouldn't have to start an executable)

3. Arithmetic with bash?

You won't see much speed difference between arithmetic in bash and any other scripting language.

By: pocams

pocams — Fri, 09 May 2008 16:58:39 -0800

I created a 100,000 line file, where each line contains a random number between 1 and 10,000,000. I used the following scripts to add it up (sorry if the bash isn't optimal, I'm no guru.)



#!/bin/bash

acc=1

while read line; do acc=$((1000000 + $line)); done



#!/usr/bin/env python 

import sys 

acc = 1 

for line in open(sys.argv[1], 'r'): 

  acc = 1000000 + int(line)  # Indent this line

Here's the results I got:



mayu:~ mark$ time ./addup.sh < rndlines

real	0m12.396s

user	0m9.873s

sys	0m1.752s



mayu:~ mark$ time ./addup.py rndlines 

real	0m0.331s

user	0m0.258s

sys	0m0.033s

Go ahead and port your code.

By: Zed_Lopez

Zed_Lopez — Fri, 09 May 2008 17:00:48 -0800

I'd expect what's faster to depend on the details of what you're doing, and you'd have to benchmark to be sure. If what you're doing involves lots of invocations of external programs (anything that's not a bash internal), my guess would be that Perl or Python would be faster.

Perl was designed to (among other things) be an alternative to sed and awk in Unix command line pipelines, e.g. for your #2

perl -F"\t" -ane 'print $F[1]+$F[2],"\n"' yourfilename

would print 567 (i.e., 123+456).

But under the deadline situation you mention, I wouldn't recommend switching.

By: Zed_Lopez

Zed_Lopez — Fri, 09 May 2008 17:02:32 -0800

But you'd probably best ignore the advice of someone who states publicly that 123+456=567. Doh.

By: ijoshua

ijoshua — Fri, 09 May 2008 17:06:55 -0800

I find Python to be exemplary at these sorts of tasks. See David Beazley's Generator Tricks For Systems Programmers for some neat tricks. String matching in Python is very high-performance compared to other "scripting" languages, and I've always found arithmetic cumbersome in the shell. If you don't know Python already, you certainly won't regret learning it.

By: ijoshua

ijoshua — Fri, 09 May 2008 17:08:21 -0800

(I'm also willing to write a speed test in Python if you can provide data.)

By: pocams

pocams — Fri, 09 May 2008 17:10:11 -0800

For your question #2, I compared a simple Python script (for line in open()..: print line.split()[2]) with awk:



mayu:~ mark$ time awk '{ print $3 }' rndtabs > /dev/null

real	0m0.928s

user	0m0.847s

sys	0m0.022s



mayu:~ mark$ time ./print3.py rndtabs > /dev/null

real	0m0.634s

user	0m0.535s

sys	0m0.044s

I ran them a few times in case caching was having an effect, but there was no real change. (Of course, these tests are probably just as fast in Perl, Ruby, or whatever language you like - I'm just comparing vs. shell scripting.)

By: zengargoyle

zengargoyle — Fri, 09 May 2008 17:10:26 -0800

Yeah right....

$ time for i in $(seq 1 100);do echo "scale=3;$i/4"| bc; done > /dev/null

real 0m0.738s
user 0m0.280s
sys 0m0.460s

$ time perl -e 'for(1..100){printf "%0.3f\n", $_/4}' > /dev/null

real 0m0.029s
user 0m0.010s
sys 0m0.020s

By: zengargoyle

zengargoyle — Fri, 09 May 2008 17:43:54 -0800

pocams' test

$ time while read line; do acc=$((1000000 + $line)); done <>
real 0m6.472s
user 0m5.940s
sys 0m0.470s

$ time perl -Mbignum -e '$acc=1000000+$_' <>
real 0m0.246s
user 0m0.170s
sys 0m0.070s

Scripting of any sort wins!

By: zengargoyle

zengargoyle — Fri, 09 May 2008 17:45:34 -0800

Sorry, bad HTML.. that's < nums.txt at the end there...

By: swngnmonk

swngnmonk — Sat, 10 May 2008 07:43:30 -0800

All the perl/python solutions have already come out, and yes, it seems like either would be a worthy substitute for sh/csh/bash/etc..

My thought -

How big is this text file? Too big to read it into memory?

The reason I ask is this - your biggest bottleneck isn't the parsing of each line - it's the reading of the file one line at a time. Yes, there are buffers at several stages, etc, but you'll be eventually hitting the disk for a lot of small reads.

You'll see a huge improvement if you suck the whole thing into memory first, and then operate on it one line at a time from there.