automate this task, please
December 2, 2007 10:50 PM   Subscribe

How do I parse a few lines from several hundred word documents into a spreadsheet?

I need to go through about 700+ word documents in a folder structure 2 levels deep. From the documents, I need to pull out a few key details. I need to pull out a unique identifier contained in the cover page of the document for one column, the name of the document (which is always following the colon).

This line is always on page 2 of the word document:
UID ###: DOCNAMEGOESHERE

Where ### is where the 2-3 digit number is, DOCNAME is where the name of the document is.


The header name should be easily/uniquely found throughout the document as it is located in the document - on its own line as:
Header Name: HEADERNAME

Where HEADERNAME is where the name of the header is.

The columns in the spreadsheet are:
Header Name | UID # | UID Name | Folder Name


If someone could prod me in the right direction, that would make me happy. Let me know if this isn't feasible. I just shudder at doing this manually.
posted by gerg to Computers & Internet (7 answers total) 3 users marked this as a favorite
 
Might want to start with a batch conversion to text. At least that's what I'd do, but then I'd be using Unix and command line tools to parse out the info I needed and convert the format (grep, cat, some Perl or Ruby).

Hopefully someone can take my words here and provide a non-Unix solution.
posted by Kickstart70 at 10:56 PM on December 2, 2007


Its totally possible in Visual Basic for Applications. You need to use a folder as a collection and run through all the files, then search for and extract the correct info from word.

Unfortunately I don't know much about using VB with word, so at this stage, all i can do is confirm that it's completely feasible.
posted by galactain at 3:23 AM on December 3, 2007


You might want to try antiword to convert the documents to text. It's free. You can then use some text processing tools to create a comma separated list to import into excel. If you're on windows, use cygwin to get access to antiword and other unix text processing tools.
posted by DarkForest at 4:54 AM on December 3, 2007


Autohotkey can launch every document, manuever to the correct location select a word copy and paste into the document that you want it in..... now the task of writing the script may take longer than performing the task manually, but if you ever have to do it again, you'll have a new skill and new program to help out.
posted by TuxHeDoh at 8:07 AM on December 3, 2007


Being a Perl geek I'd use it to grab all of the document file names from the directory structure(module File::Find). Object returned back from File::Find would be processed similar to the MS Word document extraction process detailed here. Once you get the whole document extracted in to a variable you could then convert the variable in to an array(split by line seperators) and then the information you're looking for, as long as it's absolutley standardized should be in the identical array element($doc[1]) in each document.

A bit heady yes, but if you've got a Perl guru you could give a beer or 2 to or have the confidence yourself to do, this should be efficient.
posted by mnology at 9:21 AM on December 3, 2007


Best answer: Look, this is really ugly, but I'm trapped behind a firewall at the moment, so I can't put it up anywhere else.

This can certainly be done in VBA. The below is a quick attempt going from your description. It works on my really simple test case, but is hardly bulletproof, and is not recursive. MefiMail me if you want a hand tweaking.
Sub askme()

Directory = "C:\Documents and Settings\tog\Desktop\"     '  <>

Open "C:\results.txt" For Output As 1                    '  filename for results output

FileName = Dir(Directory & "*.doc")                      '  find doc files

Do While FileName <> ""

    Documents.Open FileName:=Directory & FileName        ' Open doc file

    With Selection.Find
        .Text = "UID"
        .Forward = True
        .MatchCase = True
    End With
    Selection.Find.Execute                                'find UID line

                                                                     ' Now grab some data
    Selection.MoveRight Unit:=wdWord, Count:=1
    Selection.MoveRight Unit:=wdWord, Count:=1, Extend:=wdExtend
    RecordNumber = Selection.Text
    Selection.MoveRight Unit:=wdWord, Count:=1
    Selection.MoveRight Unit:=wdWord, Count:=1, Extend:=wdExtend
    Recordname = Selection.Text

                                                            ' output result in csv format
	Print #1, FileName & "," & RecordNumber & " , " & Recordname & "," & ActiveDocument.Path

	FileName = Dir
	ActiveDocument.Close

Loop

Close 1

End Sub
posted by pompomtom at 4:53 PM on December 3, 2007


Response by poster: thanks, i'll give this a stab when i'm back on the machine with these documents.
posted by gerg at 10:32 PM on December 3, 2007


« Older What's wrong with my car's clutch?   |   Seeking some easily available jazz recommendations... Newer »
This thread is closed to new comments.