How to run simple statistical analyses on tab-delimited data in Word docs?
April 17, 2006 4:00 PM   Subscribe

I want to take a bunch of numerical data (say, survey results) found in Word documents in roughly tabular form (though not in tables per se) and run simple statistical analyses on them. What's the best way to do this?

I'm looking at a ton of word files, each a set of survey results all its own. If I copy any particular Word file into Excel, Excel is smart enough to place each number in its own cell. That's how tab-delimited (though not in strict Word tables) they are.

Now I want to take these hundreds of survey documents, all with data in roughly the same form (ie comparable numbers in comparable cell positions), and run very simple analyses on them. E.g. which surveys have the highest numbers in this field?; among subgroups of surveys that I designate, which have the combined highest or lowest averages in another certain field?, etc.

Any recommendations on how to batch-process this large set of files with an absolute minimum of effort? Thanks.
posted by shivohum to Computers & Internet (10 answers total)
 
If you need to do this sort of task often, consider learning PERL (aka Practical Extraction and Report Language). PERL was designed to solve this type of problem. On the other hand, if this is a one-time situation, learning PERL is likely sub-optimal.
posted by ryanrs at 4:27 PM on April 17, 2006


Yes, this is a one-time thing. Learning PERL just for this task would be way beyond my desires :). But thanks for the suggestion anyway.
posted by shivohum at 4:31 PM on April 17, 2006


A far far easier alternative to Perl is Visual Basic for Applications, which was designed for exactly this sort of task. At the simplest level, you can just record a macro (which requires no programming at all). Then, open it up in the VBA editor that comes with Word/Excel and tweak to your heart's desire. There are functions for copying, pasting, selecting, selecting, and basically doing anything that Word/Excel can do, but in an automated fashion.
posted by matthewr at 5:28 PM on April 17, 2006


Record a macro while you go through the process of opening one of your files, finding the data, and copying it to the clipboard. Then poke and prod that macro using the Visual BASIC editor until it will open, find and copy data from all your source files in sequence. Have it paste everything into one great big document containing only data, with a header before each bunch that identifies which file it came from. Then open that in Excel.
posted by flabdablet at 5:31 PM on April 17, 2006


I used to do this sort of thing with various data base programs. As a matter of fact, I recently took about 250 files worth of impulse waveform data, collected it into a coherent whole and crunched it using a handful of programs, including Excel. Tedious, but do-able! Lots of scripts, macros, batch files, etc.

Excel is limited in query functions, and unless you are a real wiz with it, you are shoehorning the data into an application that may not be the best one.

IMO, the very first thing you need to do is homogenize your data and expand it by adding a field describing its origin( i.e., a field (column) with the Word file name in it or the name of the survey or the date of the survey or something similar.) You'll have an easier time crunching it if all the data is in one big ass file.

If you are already familiar with Word/Excel macros, then you have a head start, but if you aren't, better allow a day or two to get up to speed with them and do a lot of tests to make sure you are processing your files consistently. I'd say if you have less than 25-35 files, do it manually if it's really a one time thing. If you have 100+ files, macros will net you a time savings. (Some idiot is going to tell me that they can do it in an hour, but trust me, if you are a novice, you'll be hollering at your Help dropdowns for quite a while. A lot is left unsaid. )

Once the data is homogenized and in Excel, you can pretty easily get it into other more appropriate programs, and use the query facilities therein. Perhaps someone in your field of acquaintances knows SQL or something similar. I am not familiar with Perl, but it's probably something similar. I don't use Access, but I have used a lot of report generators and old data base programs, such as dBase.

If you elect to use Excel, it has a very interesting little feature called filtering. It can help screen out data of no interest to your immediate needs and there is an Autofilter function that you can apply to restrict the scope of your inquiries. There are other things, too. Again, most of it is shoehorning the data into a useable but suboptimal application, but it is possible and perfectly appropriate for one time use.

I'd have to make a kazillion assumptions about your analysis needs, but if they are simple things like mean, std deviation, histograms, etc., Excel has a lot of other functions, but it's tedious to use them, IMO. A data base manager can help get reports out sorted and processed in a coherent order. They really are more suited for that type of manipulation, IMO, than a spreadsheet program. I am also sure that, again, depending on your specific analysis needs, there are public domain programs for crunching your data. 'Simple statistical analysis' leaves too much to my imagination to make a good recommendation, I am sorry to say. VB is good stuff, but if you aren't a programmer, it might not be a quick study for a one-off project.

However, whatever you use, homogenizing your data is the first step. What you put in into is the second one.

Be prepared for a long sit in front of the PC, regardless.

Good luck. Sounds like you are in for a learning experience and the next time you see something like this, you'll be the expert!
posted by FauxScot at 7:45 PM on April 17, 2006


One more thing....

I'd do two or three files to see if you can get what you want out of a consolidated file first. That'll tell you pretty quickly if it's worth doing the rest of them.

Sorry to be so wordy!
posted by FauxScot at 7:58 PM on April 17, 2006


Yeah, if you don't want to learn VBA, here's a brute-force method. You could open all your word docs as excel files (they're already tab-delimited, so it should work, though you might have to change the file extensions.) You can copy-paste each document into an enormo-spreadsheet with something like 10 keystrokes per file:

Start in the upper left hand of the new file, ctrl-shift-right, then ctrl-shift-down to select the table, ctrl-c to copy it, alt-tab to your new enormo-sheet, ctrl-v to paste it, alt-tab back, ctrl-w to close this document, revealing the next one to do. Each subsequent time you come to the enormo file, you'll need probably 3 more for positioning your cursor where you want the next table to go.

Space out for maybe 45 minutes, give yourself carpal tunnel, and you're done.

(You could also "move/copy sheet" for each one of those, but I think the analysis would be easier with the enormo-sheet.)
posted by salvia at 11:29 PM on April 17, 2006


I also think Perl is a great tool for this, though I capitalise it rather differently...

If you want to email me some examples I can see how complex a job it is. If it's easy I'll do it for you.
  • If the documents are all the same,
  • if the other information on the page is regular in format, i.e you don't have some with a paragraph at the top and some with with two and some with none,
  • if the documents all have the same number of columns with the same data in the same column in each
then it could take as little as five minutes to consolidate them into one big document.

But all of those "if"s could make it into a nightmare job to automate and you might as well do it by hand.
posted by AmbroseChapel at 11:34 PM on April 17, 2006


Excellent answers as usual. Thanks. And AmbroseChapel, I might well take you up on that offer...
posted by shivohum at 9:20 AM on April 18, 2006


Just for the record, I took a look at some samples, and unfortunately, the way they were typed meant they failed dotpoint number three -- where there were seven columns of data, one of them empty, the typist should have done this:
 COL: 1       2      3      4       5       6       7DATA: a [tab] b [tab] [tab] c [tab] d [tab] e [tab] f
where there's nothing in column 3, but instead they'd done this:
 COL: 1       2      3      4       5       6       7DATA: a [tab] b [tab       ]c [tab] d [tab] e [tab] f
and the futzed with the ruler in Word to force the columns to line up. So unfortunately there's no way to figure out which of the seven columns those six bits of data belonged to.
posted by AmbroseChapel at 8:08 PM on April 26, 2006


« Older Free Bell ringtones anyone?   |   Swamp gas? Newer »
This thread is closed to new comments.