Need some advice on how to format an Excel sheet generated from a pdf
February 15, 2020 7:18 PM   Subscribe

I have an Excel spreadsheet from a 159 pg pdf. It contains names, addresses, state, city, zip, phone number, age, gender and a field code. I need it in a sortable format.

Right now, it looks sort of like this: (sorry it looks weird)

¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦Column 1¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦Column 2
R1¦¦¦¦¦¦¦¦¦¦¦¦Joe Brown¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦27 M U
R5¦¦¦¦¦¦¦¦¦¦¦¦Jane Brown¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦26 F U

etc..for 6000 rows.

I see three problems that need to be solved in order to turn this date into a usable and sortable table:

1. I have to extract the three data types that all ended up in r1c2.
2. I have to transpose the rows into columns.
3. I have to remove all the empty rows.

I'd really rather not do all 159 pages by cutting and pasting manually. Any ideas?
posted by dave78981 to Computers & Internet (14 answers total) 3 users marked this as a favorite
If you still have the PDF, it might be easier to start with that and use a better tool to extract the table data. In the Python world, there are tools such as tabula-py, Camelot, or Excalibur (which is a UI for Camelot).
posted by rockindata at 7:32 PM on February 15, 2020 [1 favorite]

If you have the full version of Acrobat, you can export into an Excel file.
posted by jonathanhughes at 7:35 PM on February 15, 2020 [1 favorite]

Use the Text To Columns functionality under the Data tab, the wizard is pretty self explanatory.
posted by furtive at 8:12 PM on February 15, 2020 [1 favorite]

If the patterns are regular, the 'INDIRECT' function can help you organize the data using formulas (usually combined with row and/or column counters)...
posted by jmfitch at 8:25 PM on February 15, 2020 [1 favorite]

Response by poster: The converter I used (Able2Extract Professional) is actually pretty good. The problem is the layout of the data in the pdf. It's more like 159 pages of business cards rather than one big table.
posted by dave78981 at 9:29 PM on February 15, 2020

Do you have any familiarity with a database engine like Microsoft Access or SQL Server? Either of them would be able to process it. I'm a SQL developer and can help you if you need it. Just memail me.
posted by mezzanayne at 9:48 PM on February 15, 2020

Are the patterns regular? I.e. are there 3 rows for each item with a blank row between them? Are all the blank column cells blank in every row?

If so, exporting the data as a csv and processing it with a scripting language would be relatively straightforward. Take groups of 3 lines emit the wanted data items as consecutive cells of a row, write out a csv file and reopen in excel.
posted by epo at 2:09 AM on February 16, 2020 [1 favorite]

Best answer: The openrefine application may help.
posted by oceano at 2:20 AM on February 16, 2020 [1 favorite]

Look for the mentioned Tabula online - it does have alternative interfaces.

The reality with conversion of PDF to Excel is that nothing happens without great effort. I normally have to use a text editor to modify the CSV file before trying to introduce it to Excel.
posted by yclipse at 4:06 AM on February 16, 2020 [1 favorite]

In Excel the columns have letters, and the rows have numbers. Joe Brown is in A1 and "27 M U" is in B1. You want his address, phone and city/st/zip in row 1 as well. You can just put =A2 in cell C1 =B2 in cell D1 and =A3 in cell E1. Now you have just one row for Joe. You can copy that same formula into the row 5 for Jane using columns C D and E again. You don't have to do that manually if you can filter column 2. If you only have M and F in the columns, you would filter for anything that contains one of those letters. Then you copy cells C1, D1 and E1. Then you highlight the same three columns in the rows that are blank, keeping the filter on. Now Paste the formula in and remove the filter. Now you have every record in one row, but they are all formulas, so you need to copy the C D and E again (the whole column this time) and Paste Special ... Values. Lastly, you want to remove the rows that are blank in column C by filtering and deleting them.

Now you have the value in column B that you want in 3 different columns. Move it to the end so it becomes column E. Then use Text to columns to separate it by the space delimiter. Insert a row at the top and name the columns.
posted by soelo at 5:16 AM on February 16, 2020 [2 favorites]

Response by poster: Thank you all- great ideas!

@johnathanhughes- It's already exported to excel. Now I need to fix the formatting in Excel.

@epo- they are irregular. I went back and reconverted the pdf to remove some of the irregularities but there's still a good amount. I have only minimal Access experience unfortunately.

@oceano- thanks, I just dl'd it and am checking it out. It looks promising...

@soelo- this seems promising too. I'm gonna play around with it today and see where I get.

Thanks everyone for your replies- I've got a bunch of good ideas to explore!
posted by dave78981 at 8:04 AM on February 16, 2020

Depends on the irregularities. Missing data fields? Different numbers of lines per subject. Or as it appears from your example, data in varying columns?

How far would this get?

Output to a csv file.

Read each group of csv records terminated by a blank record and output non blank fields from each line to successive fields of a new csv record. I.e. line, line 2 line3 ... with no gaps. This gets you 1 user record per line.

Read that new csv file back into excel.
posted by epo at 9:02 AM on February 16, 2020 [1 favorite]

Response by poster: Yeah, the data is set up more like contact cards than in rows and columns.
posted by dave78981 at 9:13 AM on February 16, 2020

Response by poster: So openrefine is the answer. 30 minutes playing around and I'm halfway there.

Thanks all for your advice!
posted by dave78981 at 9:54 AM on February 16, 2020

« Older Low-effort vegetarian main course for 10, with...   |   Downsizing: What do I do with my framed artworks? Newer »
This thread is closed to new comments.