Manipulating Text
June 21, 2004 7:57 PM

What text-manipulation language to teach myself some of, and what book to do it with?

So I've got legislator voting data for different committees, but it's inconveniently formatted. It look like this, or, simpler, like this:

BLAH blah blah
AYES
****
Alpha Echo
Charlie

NOES
****
Bravo

ABSENT, ABSTAINING, OR NOT VOTING
******************************
Delta

What I want to do is read in a whole bunch of these files (or the whole bunch catted together if that's easier) and output a matrix of votes:

Alpha***61119
Bravo***16191
Charlie*69169
Delta***11661
Echo****11116

And so on.

So my questions are:

(1) What language should I use to do this, knowing that apart from little bits of coding for Sas or R, I haven't really programmed anything since BASIC in 1986? My sense from googling is that the prime candidates are perl or python, and that this is not going to be a difficult task to program.

(2) What easily-obtainable book is good for teaching oneself the basics of the language? Just enough for me to figure out how to do this, not do it efficiently or elegantly. I don't mind if the machine chews on something for a minute instead of a millisecond, as the realistic alternative is entering them in by hand, and like hell am I doing that again if they're already in html.
posted by ROU_Xenophobe to Computers & Internet (12 answers total) 1 user marked this as a favorite
I can't discuss the relative merits of perl v. python, but if you go with perl, get the Llama book from O'Reilly.
posted by jacquilynne at 8:01 PM on June 21, 2004


PERL was made to do just this, and if this is all you are going to do, you can probably hack something out pretty quick. Python is a much better language in almost every sense of the word, but has a few more restrictions than PERL, meaning that there arent a bajillion different ways of doing it, but has many more advantages and is ultimately more elegant and maintainable.

PERL is extremely flexible, and whatever you still remember from programming can probably be adapted and implemented. This is its greatest strength and weakness.

If you are only doing this, PERL might be the way to go, but if this is the first in a series of projects, or you are trying to learn a language, python is probably a better choice.

As for books, I cant help you there, i am pretty much self taught, but check out the main pages perl.com and python.org. You might also check out experts-exchange or a similar site, as there is almost definitely a code snippet or script out there that does something close to this.
posted by lkc at 8:08 PM on June 21, 2004


What they said.
posted by five fresh fish at 8:13 PM on June 21, 2004


If you go for Python, which I think is a good idea, the book Text Processing in Python is available free online, and the appendix contains a script to convert the structured ASCII text the free files come in into pretty, colorful HTML.

I'm not sure how much it will help you learn to program in Python, but that's pretty easy to do on your own.
posted by kenko at 8:14 PM on June 21, 2004


Another recommendation for Python. I use it every work day for parsing and maniupulating text files, and for "gluing" applications together. Can't be beat.
posted by SPrintF at 8:25 PM on June 21, 2004


Ta much!

Is there any great (more than 50 hour) time advantage in starting with one of them?

That is, if learning(some)+using python is likely to take me another day or week to get it done, that's one thing, but if it means another month, that's a different kettle of gopher guts.
posted by ROU_Xenophobe at 8:33 PM on June 21, 2004


Python is pretty cool but if it were me I'd do it in emacs. If you're already an emacs user, then I'm guessing you could do what you want in a keyboard macro without diving down into emacs lisp. The reason I 'm guessing is that I don't understand what the second row of the output matrix is supposed to mean.
posted by rdr at 8:58 PM on June 21, 2004


I learned Python to the point of creating useful short scripts in the space of one day. Without a computer. I read "Learning Python" during a road trip. Grokked 90% of it immediately, finding it a very natural fit to my thinking.

Perl, on the other hand, looks like such a dog's breakfast that I have never had the slightest inclination to take up the challenge.
posted by five fresh fish at 9:00 PM on June 21, 2004


If it's just a one off, then I'd do it in a spreadsheet as it's not a general programming exercise, you're just trying to total fields and that kind of thing. You'll probably need to learn some macros (I'm not sure I understand the structure) but you'll be dealing with an easy data structure of cells, and depending on your office suite you can write visual basic script or OpenOffice.org script.
posted by holloway at 9:08 PM on June 21, 2004


I don't understand what the second row of the output matrix is supposed to mean

A line for each legislator followed by their sequence of votes, as culled together from the 15--200 or so votes in each committee. Each file should turn into one column of numbers, sorted by legislator.

And then again later for the 80 legislators times 2000+ votes on the floor.
posted by ROU_Xenophobe at 9:08 PM on June 21, 2004


I should have been clearer about that. I'm not just totaling fields. The first lines of the output matrix means that Alpha voted no, yes, yes, yes, absent; and that Bravo voted yes, no, yes, absent, yes.

So I need to go through these files (or through the requisite files catted into one) and say here's a vote, here's how everyone voted*, here's the next vote, here's how everyone voted, etc, until I'm out of files (or at the end of the catted file).

*which is where the files are inconvenient, because they're a nonalphabetical list of who voted yes, and then a nonalphabetical list of who voted no
posted by ROU_Xenophobe at 9:15 PM on June 21, 2004


You might be able to do it with awk. Me, I'd use perl.
posted by gimonca at 6:03 AM on June 22, 2004


« Older Basement Soundproofing   |   MeFi/AskMe for Teens Newer »
This thread is closed to new comments.