Column limits suck
February 5, 2009 1:02 PM   RSS feed for this thread Subscribe

Editor for very large flat files?

I'm assembling a large flatfile dataset to feed into another program*. Assembling it is almost entirely just a matter of pasting together existing matrices in a sort of checkerboard pattern and marking the rest of the cells as missing.

When it's done, the file will be approximately 7500 rows by 42000 columns. I'll be using csv as an intermediate format. Editing in a spreadsheet (or something identical to me as a user) would be the most convenient.

But Excel 2007 tops out at 16K columns. Is there anything out there that can open and edit a file like that? Ideally cheap or free as always.

I do know some workarounds:

(1) Transpose the matrix so it will fit in Excel 07, and re-transpose the final product in R or something else.

(2) Cobble it together out of multiple files/sheets and then paste it all back together in a text editor or with perl.

But in both of those cases, I'm left with a "final" csv that I can't open in anything to check that everything went okay.

*Estimating legislators' ideal points in a common ideological space.
posted by ROU_Xenophobe to computers & internet (20 comments total) 2 users marked this as a favorite
Textpad or Notepad++ are both pretty decent text editors that load large files pretty well.
posted by wrok at 1:28 PM on February 5


Vim.
posted by GuyZero at 1:34 PM on February 5


Umm... could you turn the schema on its side, so you have 7500 columns and 42k rows?
posted by Slap*Happy at 1:38 PM on February 5


Second Textpad and Vim. Of course Emacs should be able to handle it as well. I think you should take a look at jEdit, as it should be able to A) handle the size and B) has some helpful plugins for working with CSV files.

For systemic manipulation consider installing cygwin and learning awk or sed.
posted by teabag at 1:53 PM on February 5


A 64-bit version of emacs should work okay. This seems like a task ideally managed with Perl or Python. Is there a reason that an editor and manual compilation would be necessary?

To be able to space text, you could use a subroutine to do row and column counting:

$columns = 1; $rows = 0;
while ($line = <>) {
  chomp $line;
  @chars = split(//, $line);
  for ($i=0; $i<>     if ( ord($chars[$i]) == 9 ) { $columns++; }
  }
  $rows++;
}


With the row and column counts you can write another subroutine to join the lines or put NA or NaN values, accordingly. Since you're working with your files one line at a time, you wouldn't need as much memory with this approach.

With Perl modules, you can export each line to CSV output fairly easily:

$ perl -MCPAN -e shell
cpan> install Snail::CSV

This will do the work of escaping special characters and wrapping fields.
posted by Blazecock Pileon at 2:00 PM on February 5


OpenOffice.org, gnumeric, and Numbers are all out as their column limits are too low. Access, MySQL, and PostgreSQL all fail as well.

Apart from using a text editor like one of the ones mentioned above, I suggest giving some thought to the lay out of your data. Whenever a data set has a large number of columns, it's very likely that it can be broken into a set of related tables that are much more manageable.

It also sounds like you're trying to abuse Excel by treating it as a database, which is generally a recipe for disaster. Since you're almost certainly importing the bulk of this data programmatically anyway, it makes a lot of sense to put it into a proper database. Your statistical analysis program (you mentioned R) can interface with the database directly rather than reading a giant flat file.

Having to design a database schema and handle importation may seem like a lot of work, but I'd wager it will be less trouble in the long run, especially if your needs change.
posted by jedicus at 2:05 PM on February 5


Let's try that again:

$columns = 1; $rows = 0;
while ($line = <>) {
  chomp $line;
  @chars = split(//, $line);
  for ($i=0; $i<=$#chars; $i++) {
    if ( ord($chars[$i]) == 9 ) { $columns++; }
  }
  $rows++;
}

posted by Blazecock Pileon at 2:05 PM on February 5


I like Notepad++ a whole lot, so I'll mention that it's great... but didn't boring old Windows Notepad work to open this file? I have a hard time believing there's a file-size limit on Notepad, but I've been wrong about the limitations of Microsoft software before, I guess.

But you say you'd like to be able to open and edit a file like that somewhere. My friend, I am very sad to say so, but there is no spreadsheet program anywhere that will do this for you. You could do your transpose thing; I've never tried this, but I imagine that it will be slow and dangerous for your data.

Spreadsheets aren't meant for that many fields. Spreadsheets have upper limits because anything above them taxes them too far for their capabilities. What you need is a database.

You're not looking to create a full-featured, lasting, easily-maintained database, I know, so I think you probably just want a simple interface for an SQLite database. (SQLite is neat because, kinda like a spreadsheet, the whole database is just in one file, rather than a complex relationship of queries reported to memory and affecting data that's stored.) I poked around a bit, and there's a thing called SQLite Expert that appears to do precisely what you want; the personal version is freeware, for starters, and the pro license is pretty fairly priced ($59). The reason I gravitate toward this one is because, unlike the SQL interfaces I've seen and worked with, it appears to allow the user who wishes to to simply edit a grid of data in place. That's what you want, right? Pretty simple - similar to Excel. And this, like any worthwhile SQLite interface, should allow you to import and export csv files.

Another neat (and slightly more creative) option on the SQLite front that occurred to me was SQLite Manager, a project that some people have been putting together on Google Codes which I've been meaning to try but haven't been able to for various reasons. It's not a standalone program - it's an extension for Firefox. It lets you create and edit SQLite files in a pretty easy and intuitive way, from the screenshots, and if you already have Firefox, it might be worth the three seconds it takes to check it out. It does import/export of csv files, too.
posted by koeselitz at 2:26 PM on February 5 [1 favorite]


If you have a box around you could run GNOME on, you could try building Gnumeric with SHEET_MAX_COLS and SHEET_MAX_ROWS increased in src/gnumeric.h.
posted by nicwolff at 2:28 PM on February 5


I have yet to find flat file problems which I cannot fix with Boxer. Yea Boxer!
posted by shothotbot at 2:31 PM on February 5


SAS would be able to handle that many columns, but it's not free. I know many universities make it available to their faculty and students through some kind of pool licensing or through specialized statistics servers, so is this a possibility?
posted by needled at 2:41 PM on February 5


I work in bioinformatics, so I handle really large datasets a lot. My feeling is that coding is the correct way to do this. It's important to remember that just because you can't open the final product doesn't mean that you can't check its validity. For the sake of simplicity, lets say your data is huge matrix of numbers.

To test that the combination worked correctly, you can grep through all the small input files and get the values for one particular row. What's their sum? now check the big matrix. Is the sum of that row the same as the sum of the input data? If so, then you can be more confident that it's working.

Since you're doing this with code, it's just as easy to check the values of every row that way to make sure they match up. Frankly, 7k by 42k isn't that big - it'll probably fit in memory easily and so any kind of scripting approach should work fine and finish relatively quickly.

That's a simple example, I'm sure you can think of some other simple tests for validation. (got text? cat it all together into a string and compare md5 digests, etc)
posted by chrisamiller at 2:51 PM on February 5



Vim.


Yes. When you have a question that goes "I need an editor to do X", the answer is always Vim.

It is never Emacs. Unless you have 20 fingers and 20 toes.
posted by DecemberBoy at 2:51 PM on February 5 [1 favorite]


Thanks. I kind of figured there was nothing that would make this not a pain in the ass.

For the record:

The software (w-nominate or dw-nominate) I'm using can't interface with a database. It can't interface with anything except a correctly formatted flat ascii file (for dw-nominate, a set of flat ascii files). Nobody on Earth would ever open this file again.

All I wanted to do was be able to open the next-to-final csv file in LessLimitedSpreadsheet and be able to easily visually verify that the vote data for, say, the NY House hadn't been accidentally shifted a row or column in either direction, which would be Very Bad. I can still check, along the lines that chrisamiller noted, it's just going to be a rather greater pain in the ass than it would be if I could just open a spreadsheet and freeze panes.
posted by ROU_Xenophobe at 3:07 PM on February 5


All I wanted to do was be able to open the next-to-final csv file in LessLimitedSpreadsheet and be able to easily visually verify that the vote data for, say, the NY House hadn't been accidentally shifted a row or column in either direction, which would be Very Bad

If you do this programmatically, you can put a non-numerical character in between matrices. After you create your big matrix, you then write a subroutine that looks for numerical values where there shouldn't be any.

Or you can simply count rows and columns of individual files, and then your big matrix should logically contain a sum of those values from individual files.
posted by Blazecock Pileon at 3:26 PM on February 5


It strikes me that maybe we should just be looking for a csv editor-type-thing. This CSVed thing looks pretty good ("CSVed can handle big files and is very fast.") Apparently designed for this type of thing.
posted by koeselitz at 3:38 PM on February 5


aaand koeselitz hits the right search term! looks like there are several.
posted by ROU_Xenophobe at 5:06 PM on February 5


I deal with large text files (~1GB) and Large Text File Viewer has been a godsend. All other text editors (including Word 2007) either refused to open really large files or choked to death trying. It has buttons to move forward or back one million lines—that should tell you something.
posted by iconjack at 3:44 AM on February 6 [1 favorite]


The old but still great Programmer's File Editor? Last updated exactly ten years and five days ago, but it runs great on XPPro:
http://www.lancs.ac.uk/staff/steveb/cpaap/pfe/pfefiles.htm
posted by wenestvedt at 6:39 AM on February 6


Thanks; I was looking for a cell-based, row-column style editor, not a text editor. I expect that I'll be able to open the "final" csv in winedt, it's just that I won't be able to glance at it and see that a row or column has been misplaced because the columns won't line up.
posted by ROU_Xenophobe at 6:48 AM on February 6


« Older Commercial renters rights in N...   |   I'm building a component to a ... Newer »

You are not logged in, either login or create an account to post comments