automate this task, please
December 2, 2007 10:50 PM Subscribe
How do I parse a few lines from several hundred word documents into a spreadsheet?
I need to go through about 700+ word documents in a folder structure 2 levels deep. From the documents, I need to pull out a few key details. I need to pull out a unique identifier contained in the cover page of the document for one column, the name of the document (which is always following the colon).
This line is always on page 2 of the word document:
UID ###: DOCNAMEGOESHERE
Where ### is where the 2-3 digit number is, DOCNAME is where the name of the document is.
The header name should be easily/uniquely found throughout the document as it is located in the document - on its own line as:
Header Name: HEADERNAME
Where HEADERNAME is where the name of the header is.
The columns in the spreadsheet are:
Header Name | UID # | UID Name | Folder Name
If someone could prod me in the right direction, that would make me happy. Let me know if this isn't feasible. I just shudder at doing this manually.
I need to go through about 700+ word documents in a folder structure 2 levels deep. From the documents, I need to pull out a few key details. I need to pull out a unique identifier contained in the cover page of the document for one column, the name of the document (which is always following the colon).
This line is always on page 2 of the word document:
UID ###: DOCNAMEGOESHERE
Where ### is where the 2-3 digit number is, DOCNAME is where the name of the document is.
The header name should be easily/uniquely found throughout the document as it is located in the document - on its own line as:
Header Name: HEADERNAME
Where HEADERNAME is where the name of the header is.
The columns in the spreadsheet are:
Header Name | UID # | UID Name | Folder Name
If someone could prod me in the right direction, that would make me happy. Let me know if this isn't feasible. I just shudder at doing this manually.
Its totally possible in Visual Basic for Applications. You need to use a folder as a collection and run through all the files, then search for and extract the correct info from word.
Unfortunately I don't know much about using VB with word, so at this stage, all i can do is confirm that it's completely feasible.
posted by galactain at 3:23 AM on December 3, 2007
Unfortunately I don't know much about using VB with word, so at this stage, all i can do is confirm that it's completely feasible.
posted by galactain at 3:23 AM on December 3, 2007
You might want to try antiword to convert the documents to text. It's free. You can then use some text processing tools to create a comma separated list to import into excel. If you're on windows, use cygwin to get access to antiword and other unix text processing tools.
posted by DarkForest at 4:54 AM on December 3, 2007
posted by DarkForest at 4:54 AM on December 3, 2007
Autohotkey can launch every document, manuever to the correct location select a word copy and paste into the document that you want it in..... now the task of writing the script may take longer than performing the task manually, but if you ever have to do it again, you'll have a new skill and new program to help out.
posted by TuxHeDoh at 8:07 AM on December 3, 2007
posted by TuxHeDoh at 8:07 AM on December 3, 2007
Being a Perl geek I'd use it to grab all of the document file names from the directory structure(module File::Find). Object returned back from File::Find would be processed similar to the MS Word document extraction process detailed here. Once you get the whole document extracted in to a variable you could then convert the variable in to an array(split by line seperators) and then the information you're looking for, as long as it's absolutley standardized should be in the identical array element($doc[1]) in each document.
A bit heady yes, but if you've got a Perl guru you could give a beer or 2 to or have the confidence yourself to do, this should be efficient.
posted by mnology at 9:21 AM on December 3, 2007
A bit heady yes, but if you've got a Perl guru you could give a beer or 2 to or have the confidence yourself to do, this should be efficient.
posted by mnology at 9:21 AM on December 3, 2007
Best answer: Look, this is really ugly, but I'm trapped behind a firewall at the moment, so I can't put it up anywhere else.
This can certainly be done in VBA. The below is a quick attempt going from your description. It works on my really simple test case, but is hardly bulletproof, and is not recursive. MefiMail me if you want a hand tweaking.
This can certainly be done in VBA. The below is a quick attempt going from your description. It works on my really simple test case, but is hardly bulletproof, and is not recursive. MefiMail me if you want a hand tweaking.
posted by pompomtom at 4:53 PM on December 3, 2007Sub askme() Directory = "C:\Documents and Settings\tog\Desktop\" ' <> Open "C:\results.txt" For Output As 1 ' filename for results output FileName = Dir(Directory & "*.doc") ' find doc files Do While FileName <> "" Documents.Open FileName:=Directory & FileName ' Open doc file With Selection.Find .Text = "UID" .Forward = True .MatchCase = True End With Selection.Find.Execute 'find UID line ' Now grab some data Selection.MoveRight Unit:=wdWord, Count:=1 Selection.MoveRight Unit:=wdWord, Count:=1, Extend:=wdExtend RecordNumber = Selection.Text Selection.MoveRight Unit:=wdWord, Count:=1 Selection.MoveRight Unit:=wdWord, Count:=1, Extend:=wdExtend Recordname = Selection.Text ' output result in csv format Print #1, FileName & "," & RecordNumber & " , " & Recordname & "," & ActiveDocument.Path FileName = Dir ActiveDocument.Close Loop Close 1 End Sub >>
Response by poster: thanks, i'll give this a stab when i'm back on the machine with these documents.
posted by gerg at 10:32 PM on December 3, 2007
posted by gerg at 10:32 PM on December 3, 2007
« Older What's wrong with my car's clutch? | Seeking some easily available jazz recommendations... Newer »
This thread is closed to new comments.
Hopefully someone can take my words here and provide a non-Unix solution.
posted by Kickstart70 at 10:56 PM on December 2, 2007