Taking data from a PDF and putting it into a spreadsheet
August 11, 2014 10:45 AM   Subscribe

I have a PDF with several hundred pages. Each page is formatted identically and contains data on a single person, divided into various tables. I want to parse the PDF so that each person's data is a single row in a spreadsheet, with various key points (e.g. "name") in the same column. I feel like this has to be an already-solved problem. How can I do this?
posted by LSK to Computers & Internet (7 answers total) 1 user marked this as a favorite
It kind of depends on the PDF. You need to get raw text out.

Can you open the PDF in Adobe InDesign/find someone to do that for you? I might be able to take a look later if you want to memail me. Assuming you need to keep in within your company, you should try to find someone to open it in Indesign and see if you can copy the text out of the document.

Option B (can't get raw text out) is to highlight the text and paste it into microsoft word. Then, you could try to eliminate fake spaces (this usually happens due to kerning) by find and replace " " with " " and " " with " ". (2 to 1 and 3 to 1 spaces).

There is a way to import data from a text to a spreadsheet. The google search you are looking for is "text to columns excel". Depending on the format of the tables and how the text comes out of the document, you might need some fancy macros/if-then statements to do the rest of what you want.

Unfortunately, all of these steps will be difficult and probably take an hour setup and then 5 minutes a page, roughly. Unless your excel mumbo jumbo is really strong, then you might get it to two-four hours setup and no minutes a page.

If you are at work, look for a guy that has spreadsheets hanging on his cubicle wall and has multiple monitors. He won't want to help, but if you check his calendar first, and he has 3 hours, and you offer to buy him a free lunch or something, it might make it more acceptable to ask.

Feel free to provide follow-up-details and I'll do my best to help you from here! :)
posted by bbqturtle at 10:58 AM on August 11, 2014 [1 favorite]

I was able to extract the raw text and I have it in a spreadsheet. I'm not sure how to take that and format it nicely, though. For instance, using names as the example - I have all the names, and they all repeat mostly periodically (some have an extra row between them due to an extra row in the original data), but I'm not sure how to take that and put it into a nice organized sheet.
posted by LSK at 11:06 AM on August 11, 2014

Is the data in one long row, in one cell, broken into columns, or still in the tabled format? And if it's still in the tabled format, what indicates the start of a new set of data?
posted by bbqturtle at 11:08 AM on August 11, 2014

It's broken into columns but not nicely (e.g. some things I want to extract separately are in the same column.) There's no indication whatsoever as to a new set of data. I set up a column that indicates whether a given row has a person's name in it, but I don't know how to e.g. get the 15th name, given that.
posted by LSK at 11:09 AM on August 11, 2014

Hmm. It's really hard to help with the information you've given me.

It's definitely going to be easier to break this apart when you first put it in excel than after you put it in excel.

In excel: There are definitely ways to solve your problem, but it's going to be an in-person ordeal. A long combination of it statements involving the search( feature, left( feature, sumifs( some rearranging, some pasting of values.

outside of excel/on import. My guess is that the first step would be to copy the text into notepad/word, replace spaces or something with commas, (perhaps manually? maybe you could outsource this step/give to an intern?) then resave it as a CSV and import the CSV to excel. You can reverse engineer this without googling by saving a desired output in excel as a CSV and opening it in notepad. Again, you'll need to look up paste special, text to columns, etc in google to help prepare for this.
posted by bbqturtle at 11:20 AM on August 11, 2014

Would Google Refine or OpenRefine be useful for this?
posted by kristi at 10:13 AM on August 13, 2014

I've heard Tabula is good for this. Here are some other suggestions (and their Tabula link is wrong).
posted by unknowncommand at 10:31 AM on March 22, 2015

« Older What are some activities/dates I can do to boost...   |   Accommodating the non-gluten crowd. Newer »
This thread is closed to new comments.