Help me streamline a data gathering workflow?
February 15, 2016 2:46 PM

What started as a screen scraping project has evolved (devolved?) into working with lots of word files with embedded tables. I found and customized an excel macro to extract them on a file by file basis. I'll describe what I'm doing with it inside. Can you give me some pointers on further automating what I'm doing -- or point me to a forum that can?

A particular web-based, database service I work with has a lot of data that I'm interested in. They don't allow scraping; they aren't interested in configuring things so that their front-end will put things together usefully for my use-case; nor are they willing to put it together from their back-end for a price I can afford. They do, however, have a tool that will generate summary Word Docs (docx) that have embedded tables that I can work with.

Thus far, I've been extracting information from each docx into an excel workbook. I've saved a template version of a workbook to process each docx. Generally, here's how it goes:

- Open template workbook and 'save as...' to mirror object docx.
- Run a macro (found online and customized) in workbook to pull all table data into worksheet1.
- Copy data from worksheet1 into worksheet2 where I've saved several formulas that help me identify the rows I'm interested in and concatenate information from different tables in the docx.
- In worksheet2: add a field that identifies the source docx; find and replace an extraneous string in several cells; filter a column that reveals only rows of interest.
- Copy rows of interest from worksheet2 and paste values in worksheet3 that uses formulas to parse text in each row.
- Copy worksheet3 and paste values into worksheet4 yielding a (mostly) tidy table of the data I'm interested in.
- Copy all of the worksheet4 tables from all of the excel files into a master workbook for analysis (pivot tables, filtered tables, and the like).

By way of background, I've built some simple form-based Access applications (see previous AskMe here) and have cobbled together some Python-based scraping scripts. I've gotten where I am through google-hunting-and-pecking on Word and Excel with signs pointing to beefing up my VBA skills. I could really ramp up this project if I find an efficient way to process all these Word docs strewn into a growing collection of folders and subfolders.

Do you have any specific advice about streamlining based my general description? Or can you point me to the best way to parse this question into specific, technical details that I can look up (or ask again, possibly on a more-targeted forum)?
posted by GPF to Computers & Internet (3 answers total) 3 users marked this as a favorite
Pull that data into a worksheet and then put it into Access. Access tables are not much different than a single worksheet in Excel. Pasting that data into another worksheet with formulas would be accomplished by writing a query in Access. All of those 'Copy Data into another sheet' steps could probably be done in a single Access macro.
posted by soelo at 3:05 PM on February 15, 2016


I'd recommend taking a high level view of the problem, decide what steps will be necessary, then focus on one step at a time. For example, since you'll be performing the same operation on each docx file, you'll need to find out how to list the *.docx files in a folder and process each one in a loop. Figure out how to write this loop (perhaps just displaying each file name in the debug window using Debug.Print), then move on to the next problem.

I'm not going to go too in depth on this, but since I've done a lot of VBA work I know a lot of what you have to do off the top of my head, I can give you some pointers and links to the official Microsoft documentation of the functions you will need to use.

[- Loop through all the *.docx files in a folder]
Use the Dir function and a For loop. You'll want to write this macro in a separate Excel file that will be open the whole time while it is processing the docx files and creating new Excel files based on your template.

- Open template workbook and 'save as...' to mirror object docx.
Use Workbooks.Open and Workbook.SaveAs.

- Run a macro (found online and customized) in workbook to pull all table data into worksheet1.
Application.Run. Note that part that says "the [macro name] will be evaluated in the context of the active sheet": this means that if you want to run a macro in a particular workbook, you need to make sure that a sheet in that workbook is activated before you call Run.

- Copy data from worksheet1 into worksheet2 where I've saved several formulas that help me identify the rows I'm interested in and concatenate information from different tables in the docx.
- In worksheet2: add a field that identifies the source docx; find and replace an extraneous string in several cells; filter a column that reveals only rows of interest.
- Copy rows of interest from worksheet2 and paste values in worksheet3 that uses formulas to parse text in each row.
- Copy worksheet3 and paste values into worksheet4 yielding a (mostly) tidy table of the data I'm interested in.
- Copy all of the worksheet4 tables from all of the excel files into a master workbook for analysis (pivot tables, filtered tables, and the like).

This is all pretty straightforward manipulation of Range objects in Excel. Don't get too caught up on the details here because you'll just overwhelm yourself. Focus on each problem a bit at a time.

You'll receive better help asking on stackoverflow.com, but the moderators there will get upset if it seems like you're asking for someone to write a lot of code for you. Make sure you keep the questions specific to a single issue, indicate the steps you've taken to research the problem yourself, and be clear on where you're having difficulty.
posted by zixyer at 3:26 PM on February 15, 2016


Maybe you could learn from a similar project.
posted by w0mbat at 9:38 PM on February 16, 2016


« Older Good Excel/CSV data visualisation tools for...   |   Warm Showers hospitality for touring cyclists Newer »
This thread is closed to new comments.