Column-oriented CSV in Python?
January 11, 2010 9:08 AM   Subscribe

Is there a simple, elegant way to iterate over columns in CSV files using Python?

I'm trying to process some CSV files using Python and its built-in csv module. However, the files are all column-oriented, and csv seems purely row-oriented. Ideally, I'd like to iterate over columns in the same way that it lets you iterate through rows.

The input file may consist of an arbitrary number of columns (probably less than 50, but there is no hard limit) and a much larger number of rows (possibly thousands), but there are varying numbers of rows in each column. I'd like to have the column as a list that I can then play with.

I feel like there has to be a simple way to do this, but I can't figure it out and I'm working on something of a deadline. Is there an easy way that I'm missing? It seems like this has to be a common issue, since non-programmer spreadsheet users tend to create column-oriented files all the time.

As a workaround, I've been opening the CSV in Excel and transposing the data, making the lists into rows instead of columns, but this is time-consuming and just rubs me the wrong way anyway.

Just to clarify, when I say "column-oriented", this is what I mean:
There is no logical connection between "Alabama", "Red" and "pine". The logical grouping is a list of states, then a totally separate list of colors, and then a totally separate list of trees. I want to work on each list separately.
posted by Kadin2048 to Computers & Internet (13 answers total) 3 users marked this as a favorite
I haven’t worked with Python’s CSV module, but can’t you read the whole file into memory and do the transposition in Python?

states = [], colors = [], trees = []
for row in csvreader:

posted by ijoshua at 9:14 AM on January 11, 2010

One way you could do it that has the benefit of simplicity (if not elegance, if elegance means it does it for you) would be to open the csv file as just a file and iterate over the rows and build the column lists yourself. I'm not sure if each of your csv files has the same or different headers or how adjustable you want the code, but it would be relatively simple using the readlines() method on a file descriptor to get all the lines in the file. Then, use the string split(",") method to split the string into a list of strings (it removes the comma) and sort each one into either a hardcoded set of lists (if your csv files all contain the same information), or you could use the header line to instantiate a dictionary of lists (i.e. data = {'states':[], 'colors':[], 'trees'=[]} that you would then fill.

Or, on preview seeing ijoushua's comment, csvreader does seem able to return you all the data separated nicely for resorting without having to parse it yourself.
posted by bookdragoness at 9:20 AM on January 11, 2010

Best answer: There is a simpler way:

import csv

f = csv.reader(open('file.csv'))
states, colors, trees = zip(*f)

posted by jedicus at 9:21 AM on January 11, 2010 [2 favorites]

Uh, read it in with a DictReader, enumerate the dictionary keys, create a set per column and as you read each line copy each element into the appropriate set? Then you'll end up with one set per column which you can do whatever with. Uses memory, but meh.
posted by GuyZero at 9:22 AM on January 11, 2010

jedicus wins.
posted by GuyZero at 9:22 AM on January 11, 2010

Response by poster: Wow, that was quick. Thanks, everyone so far.

Jedicus: is there a way to do that for an unknown number of columns? I'd like to be able to take as input a file that might have 1 column, might have 3, might have 25... etc. Is there a way to loop over that, or maybe create a list of lists that I can then iterate through?

(In reality the input files I'm getting are groups of users; each column is a group, where row[0] is the group name, and then the remaining rows are usernames. A single file might have many groups, each with many users.)

I've haven't run into zip() before, so I'm trying to bring up the docs on it now... is a bit slow today.
posted by Kadin2048 at 9:39 AM on January 11, 2010

This isn't quite answering the question, but this kind of task would also be very easy in the language R, which is free, easy to install, and has great built-in stats and plotting capabilities. If you read in a CSV it'll be a data frame which has the columns as natural iterators. I'm a big fan of both Python and R - but for easy data manipulation R would probably be my choice.

# read in a CSV file with headers
my_table = read.csv("myfile.csv", header=T)

#access the column with header "states"

# access the first column

# iterate over columns and print 3rd element of each column
# this would give output Arizona, Green, poplar
for(column in my_table) {

posted by 7-7 at 9:48 AM on January 11, 2010

jedicus: is there a way to do that for an unknown number of columns? I'd like to be able to take as input a file that might have 1 column, might have 3, might have 25... etc. Is there a way to loop over that, or maybe create a list of lists that I can then iterate through?

zip() returns a list of tuples. In this case, each tuple would contain a column of data. You should be able to iterate over the list and handle each tuple individually. If your data has headers, then the first element in each tuple should be the column header.
posted by jedicus at 10:00 AM on January 11, 2010

Best answer: For an unknown number of columns, replace
states, colors, trees = zip(*f)

columns = zip(*f)

and determine the number of columns with

posted by ijoshua at 10:00 AM on January 11, 2010

i don't know if python has it, but in ruby, you can use the 'transpose' method of an array, do your manipulation, then 'transpose' it back.
posted by ArgentCorvid at 10:03 AM on January 11, 2010

Best answer: zip essentially is a transpose function.

Careful, though! If your columns have different lengths, zip will truncate them all to the length of the shortest column. itertools.izip_longest is like zip, but the longest column is kept, and shorter ones are padded with None.

Maybe try something like this:
list(list(itertools.takewhile(lambda x: x is not None, column))
     for column in itertools.izip_longest(*table))

posted by domnit at 11:43 AM on January 11, 2010 [1 favorite]

domnit raises a good point. I had assumed that, like most csv files, your table is rectangular. If it isn't, or you want to code for a more general case, you'll need a more complex solution.
posted by jedicus at 12:06 PM on January 11, 2010

Response by poster: Interesting on the truncation. I couldn't figure out why my data was working, since I have columns of varying heights. But then I looked at it and remembered that Excel (which is where these CSVs are coming from, originally) seems to pad out rows to bring them all up to the same number of columns, even if there is no more data in the remaining columns in that row.

E.g., running
columns = zip(*csvreader)
for column in columns:
        print column
On this (which is from Excel's CSV exporter):
works just fine. But if you knock off (in a text editor) the trailing commas after "six", it's still valid CSV in Excel (and is displayed the same), but columns 3 and 4 get lost from the output.

The itertools version handles this situation perfectly. (You get a mix of empties where Excel padded the rows, and then nulls where izip_longest does.)

posted by Kadin2048 at 1:15 PM on January 11, 2010

« Older Can I get a discount on postage for mailing 300...   |   Help me archive a roomful of old Mac data Newer »
This thread is closed to new comments.