Shell script handling CSV columns
June 10, 2009 4:30 PM   Subscribe

I'm using a shell script to handle some excel docs by converting to csv and reformatting it a little bit. I'm having some difficulty going on with the next part of the script. Some help would be great. :)

ls *.xls
echo "Enter filename for excel input file: "
read filen
xls2csv -x "$filen"".xls" -b WINDOWS-1252 -c "$filen"".csv" -a UTF-8
sed '1,3d' "$filen".csv >"$filen"_temp.csv
cut -d "," -f 1,4- "$filen"_temp.csv > "$filen"_temp2.csv
awk -F, '{sub($NF, "");print}' "$filen"_temp2.csv > "$filen"_temp3.csv
sed 's/.$//' "$filen"_temp3.csv > "$filen"_fixed.csv
rm "$filen".csv
rm "$filen"_temp.csv
rm "$filen"_temp2.csv
rm "$filen"_temp3.csv
echo "$filen"".xls corrected and saved as ""$filen""_fixed.csv"

EXAMPLE OUTPUT (first three lines only):
"Device ID","1) S31 Which best describes how you answered the online reading comprehension quiz?","2) S32 Which best describes how you answered the online timed retrieval quiz?","3) B19. If you want your product to be easy to find in the supermarket then you should make its container","4) C19. So that he can shift attention between the radio and his incessantly talking girl friend when she is in the car, Joe adjusts his radio","5) B20. Early selection is most likely to occur for","6) C20. Early selection for a red target is most likely to occur when there is","7) B21. In a lexical decision task, when the target is a bird name, e.g. robin, it is usually preceded by the prime BODY but is sometimes preceded by the prime BIRD.","8) C21. In a lexical decision task, when the target is a dog name, e.g. collie, it is usually preceded by the prime CAR but is sometimes preceded by the prime DOG.","9) B23. Suppose that that you see a brief display with 12 colored letters: 4 red, 4 white, and 4 blue. At the offset of the display you hear tone. A tone instructs you to report only the letters of a particular color: high for red, medium for white, and low for blue. About how many letters do you report?","10) B22. Sperling (1960) found that partial report produced the highest estimate of the number of available letters when the tone occurred ","11) C23. According to the logic of Sperling’s (1960) partial report method, an observer who reports three letters from a row in a 4 x 4 display that was cued at the display’s offset must have seen at least","12) C22. Sperling (1960) found that the greatest difference between full and partial report in the number available of letters was when the tone occurred ____ milliseconds after the offset of the visual display"
96A39,6,4,"4 c","4 c","5 c","5 c","5 c","4 i","3 c","1 c","1 i","5 i"
1E90A4,5,3,"4 c","4 c","5 c","5 c","2 i","5 c","3 c","1 c","4 i","4 i"

First, I need to remove the numbering of the questions from the first line. So "1) S31 Which best describes how you answered the online reading comprehension quiz?" would become "S31 Which best describes how you answered the online reading comprehension quiz?"

Next I need to remove an entire column if its row header doesn't begin with either B or C or is the "Device ID" column. For example, column starting with B19 is kept, the C23 is kept, and"Device ID" column is kept. The column with the question starting S31 must be entirely removed.

posted by fightoplankton to Computers & Internet (11 answers total) 3 users marked this as a favorite
No help with your script, but it seems this would be easier with a scripting language like Perl or Python which has libraries built to handle CSV and even Excel.
posted by wongcorgi at 4:33 PM on June 10, 2009

This kind of stuff is really easy in Perl or similarly-featured languages (Python, Ruby, etc.). And I believe shell scripts are the only things harder to read than Perl. Why are you using a shell script?
posted by mohrr at 4:33 PM on June 10, 2009

I too would do this in Python because quoted CSV handling is much better. One comma in between the quotes and awk is toast. The Python csv module is awesome and has saved my butt lots. There are also direct XLS libraries for Python to make it even easier. here's a blog post on the xlrd library:
    import xlrd
    book = xlrd.open_workbook("myfile.xls")
    sh = book.sheet_by_index(0)
    for r in range(sh.nrows)[1:]:
    print sh.row(r)[:4]
So sweet.
sudo easy_install-2.5 xlrd
installs it.
Everyone should learn Python.
posted by GuyZero at 5:04 PM on June 10, 2009 [1 favorite]

GuyZero: Everyone should learn Python.
Sorry for the derail, but... Okay, how? What's the best/easiest way to learn functional python?
posted by coriolisdave at 5:51 PM on June 10, 2009

A copy of 'Python in a Nutshell' and just digging in. The online docs are really good and the language basics are pretty, well, basic once you get over the issue of semantic whitespace - indentation matters for those who don't know the language. It really is a great scripting solution for situations like this.
posted by GuyZero at 5:56 PM on June 10, 2009

I also recommend Python. The standard library comes with an excellent CSV library.
posted by demiurge at 6:53 PM on June 10, 2009

seems like you're getting close with sed and awk. These newfangled Python and Perl whippersnappers might be right that their language can do this stuff well*, but so can old fashioned awk if that's really what you know best.

The print command is fairly flexible and lets you selectively output individual columns. Nest it within an "if" and you're really on to something:

awk -F"," 'for (x=1; x0 {print $x} }'

You'll want some nested ifs in there to control which columns you're dealing with and what substrings to accept/reject. But done right, you should be able to do everything in one carefully written script.

* now get off my lawn!

posted by drmarcj at 7:18 PM on June 10, 2009

Response by poster: It honestly should have been written in perl since i am already using a perl script in the shell script. Also, I forgot to make this compatible for windows. *SIGH* Looks like I'm starting over
posted by fightoplankton at 8:27 AM on June 11, 2009

Best answer: The idea of using shell script to process CSV files is just perverse.

I love it.

Here's an exceedingly cheap and nasty CSV library written in bash, and here's a stub script that exercises it (pipe your CSV data into that to see what it does). If nobody has picked up this ball and run with it by the time I wake up again, I'll whip up something that uses these to do the header editing and column deletion you want.
posted by flabdablet at 8:57 AM on June 11, 2009

Response by poster: I've tried doing this in ruby. I like it much better but I can't get as far in my task, and have yet to figure out how to delete columns. Help would be cool.

require 'rubygems'
require 'roo'
require 'csv'
require 'fileutils'

FileUtils.mkdir_p "/Users/pshapiro/Desktop/Excel/xls"
FileUtils.mkdir_p "/Users/pshapiro/Desktop/Excel/tmp"
FileUtils.mkdir_p "/Users/pshapiro/Desktop/Excel/csv"

@filesxls = Dir["/Users/pshapiro/Desktop/Excel/*.xls"]
for file in @filesxls

@filesxls = Dir["/Users/pshapiro/Desktop/Excel/xls/*.xls"]
@filetmp = Dir["/Users/pshapiro/Desktop/Excel/xls/*.xls_tmp"]

for file in @filesxls
convert =
convert.default_sheet = convert.sheets[0]

@filestmp = Dir["/Users/pshapiro/Desktop/Excel/xls/*.xls_tmp"]

for file in @filestmp

dir = "/Users/pshapiro/Desktop/Excel/tmp/"
files = Dir.entries(dir)
files.each do |f|
next if f == "." or f == ".."
oldFile = dir + "/" + f
newFile = dir + "/" + File.basename(f, '.*')
File.rename(oldFile, newFile)

files = Dir.entries(dir)
files.each do |f|
next if f == "." or f == ".."
oldFile = dir + "/" + f
newFile = dir + "/" + f + ".csv"
File.rename(oldFile, newFile)

@filescsv = Dir["/Users/pshapiro/Desktop/Excel/tmp/*.csv"]

for file in @filescsv


@filescsv = Dir["/Users/pshapiro/Desktop/Excel/csv/*.csv"]

for file in @filescsv
5.times {
text="","r"){|f|f.gets;},"w+"){|f| f.write(text)}
posted by fightoplankton at 9:41 AM on June 14, 2009

As a shameless Best Answer whore, I am now of course obliged to take this problem seriously :)

Let me see if I understand what you're trying to do with your initial shell script. As far as I can see, you want to take the output of xls2csv (which appears to have different options from the version I'm using in Ubuntu) and do the following things:

* delete the first three rows
* delete the second and third columns
* delete the first occurrence of text matching the number of remaining columns within each row (I'm baffled by this)
* delete rows consisting of exactly one character (form-feed inter-sheet markers?)

and then perform the text manipulation you described in English. Do I have that right?

If so, and you email me (address is in profile) a sample xls2csv output file to give me something real to chew on, I'll post back a script that should do what you want.

Also: is there a good reason you're using temporary files instead of pipelines to tie the processing stages together?
posted by flabdablet at 6:05 PM on June 14, 2009

« Older What kind of bird is this?   |   American working in Japan: What are tax... Newer »
This thread is closed to new comments.