Adobe and the inflexible acrobat
April 10, 2008 9:30 AM   Subscribe

I have a few pdfs of tabular data that I would like to use in a spreadsheet and/or database. Acrobat Reader is not cooperating; it converts obvious tabs into spaces. Is there some way to get this data, properly formatted?

I have 3 pdfs of 8 columns of data, roughly 100+ pages apiece. The data is clearly in a column format, but copy + paste converts the tabs into spaces, regardless of whether it comes from a browser plugin or from Acrobat Reader itself, and regardless of whichever program it goes into.

Exporting as txt also converts the tabs into spaces, in addition to getting the column order wrong on any row of text with line breaks in it. (something like "1 / 2, 2, 2 / 3" comes out as "2 / 1 / 2 / 3 / 2")

If I copy and paste, I can not do a global replace to convert spaces to tabs (and then a series of them to convert double-tabs back into single tabs) because many of the columns have spaces in the data inside.

I've looked online and found a number of shareware tools claiming to convert .pdfs to .xls, but I don't know how successful or trustworthy they are, and I'm hesitant to install anything unknown/untrusted onto workspace computers.

Tech Services here at work could not figure out how it's done.

I'm not convinced there isn't a workable solution. Do any of you have experience with this problem? Is it possible without buying new software?
posted by johnofjack to Computers & Internet (44 answers total) 2 users marked this as a favorite
 
Can you post one of the pdf files somewhere we could play with it?
posted by flabdablet at 9:45 AM on April 10, 2008


I've never tried this from a pdf, but if the data is "clearly in a column format" can you try fixed width columns, rather than tab delimited?
posted by jermsplan at 10:50 AM on April 10, 2008


There is no easy way but I think you are on the right track. Replace the spaces with commas and then manually edit the ones that are wrong. Import the file in Excel as a CSV and you should be good. Most database programs should easily read CSV files. If you have columns that have text data with spaces then concatenate those cells.
posted by JJ86 at 11:38 AM on April 10, 2008


BTW, once you have the raw data pulled out of the PDF any database guru can write scripts to reassemble the data the way it should be. Try to see if Tech Services has one of them.
posted by JJ86 at 11:40 AM on April 10, 2008


The only thing that makes this improbable is if you have more than one space in your data fields AFAIK. Otherwise just replace 3 spaces with 2 a few times and then replace all occurrences of two spaces w/a tab.

Beyond that, if I saw some (fake) data, I could use regex S/R to handle it w/relative ease in an editor like notepad++ or perl.
posted by prodevel at 11:56 AM on April 10, 2008


Response by poster: There is more than one space in several of the data fields.

Manually editing the data is not a workable solution, since there are 3 100+ page pdfs of several columns of data.

Regex hadn't occurred to me, but it might be a solution; maybe one of the computer gurus here will know.

I'll post one of the pdfs when I get home in a few hours.
posted by johnofjack at 12:40 PM on April 10, 2008


Response by poster: Ok, nevermind; I uploaded it to box.net and set it to shared: http://www.box.net/shared/c3q7gce0ww

flabdablet, if you (or anyone else!) would like to take a look at it to see what I'm talking about and the problems faced, that would be great.
posted by johnofjack at 1:02 PM on April 10, 2008


Best answer: Well it would make it ten times easier if you had access to the report that spews this out this mess, which I suspect you don't. If so you could adjust the title/author fields larger and 'print' in landscape so that these fields wouldn't word wrap. Or instead of pdf/print, if there were an export feature.

Programmatically, there would be almost no way to know whether the wrapped text was either the title or author fields, or even both. Unless you were to somehow do a length on both fields prior to scanning the next wrapped line and base it on that. Even so, to be able to determine in the case where you have both title and author fields wrapped, to decide which word(s) were part of which field would be near impossible unless you were slick enough to determine the rules about the wrapping.

my $.02 anyway. Someone may have the parse-fu.
posted by prodevel at 1:58 PM on April 10, 2008


pdftotext -layout Glen\ Spring.pdf glen.txt

Gets you the data ( it is nicer formatted than this ) ;

Quiz No. Lang. Title Author IL BL Points Count F/NF
#9340 EN Snow Joe Greene, Carol LG 0.3 0.5 59 Fiction
#36573 EN Big Egg Coxe, Molly LG 0.4 0.5 99 Fiction
#9306 EN Bugs! McKissack, Patricia LG 0.4 0.5 69 Fiction
C.
#41850 EN Clifford Makes a Friend Bridwell, Norman LG 0.4 0.5 67 Fiction
#42156 EN I Am Lost! Wilhelm, Hans LG 0.4 0.5 64 Fiction
#31586 EN Oh, Cats! Buck, Nola LG 0.4 0.5 94 Fiction
posted by stuartmm at 1:59 PM on April 10, 2008


Well the author field is almost always 2 words except on the rare occasion when there is an initial (ignoring the hyphens which will not be affected). Work from right to left on the concatenate and just drop/ignore the initials where you run across them. Once you get the author it should be pretty easy to write a concatenate equation to splice together the title. I assume you know your way around Excel enough to handle that?

It is an exercise in frustration but seems doable within a week, maybe :)
posted by JJ86 at 2:09 PM on April 10, 2008


There are only a few names which break the mold such as "Salzmann, Mary Elizabeth". Once you get the majority of everything else taken care of you can go back and search for those names to replace them. Ideally you could find out those names when you are putting together the CSV file and replace them correctly after the commas are inserted.
posted by JJ86 at 2:13 PM on April 10, 2008


There seem to be about 375 unique lines that have folded, for what it's worth.

(Creative Ed.)
Abenaki Legend
About the Human Body
Achilles, The
African Tale
Agra
...

World)
Wrinkled Knees (Abridged)
Yatsevitch
Young Readers)
Z.
Zipporah Feldman, a Jewish Immigrant
posted by prodevel at 2:47 PM on April 10, 2008


I hate to advocate here but with this kind of info perlmonks.com would have a field day with this, probably trying to post the most efficient code to handle it. They're the best around...
posted by prodevel at 7:29 PM on April 10, 2008


I've spent a couple hours fartarsing around with a script that attempts to parse the output from pdftotext -layout, and I've come to the conclusion that I don't really want to spend any more time on this. It's actually quite annoying stuff to parse, because it's only about 95% regular. There are occasional fields that don't line up with those above or below, the columns are in different positions on different pages, sometimes the titles are split as well as the authors, there aren't always two spaces at the end of the title or author fields, and on and on and on... pdftotext damages the layout just enough to make things quite difficult.

If this is a once-off exercise and the three PDFs are absolutely all you've got to work with, you'd probably be better off feeding them through a decent OCR (or even a crap one - Microsoft Office Document Imaging might do), having it recreate tabular text for you, and then fixing at least some of it up by hand. If it isn't a one-off, and you're trying to build these reports into some kind of ongoing workflow - then all I can really offer is my sympathy.

This is, unfortunately, a fairly typical result for an attempted printout-scrape. They don't often work well. If there is any way at all that you can get access to the data that gave rise to these PDFs, doing so will save you absolute bags of time.
posted by flabdablet at 6:11 AM on April 11, 2008


By the way: when you copy/paste this stuff, there are no tabs being converted into spaces. What you're seeing is pixel-positioned printed text (which has pretty much completely lost any notion of internal relationships) being re-ordered to approximate the way it looks on screen. This is also what tools like pdftotext -layout are trying to do, and it really doesn't work very well at all. I wouldn't expect anything good to come of time spent trying out PDF to XLS converters on this source material.
posted by flabdablet at 6:16 AM on April 11, 2008


Just to illustrate that point: here is a file containing all the string literals from the PostScript code corresponding to your PDF (it's the output of

pdftops Glen\ Spring.pdf - | sed -n '/(/s/[^(]*\(([^)]*)\).*$/\1/p'

in case you care about that kind of thing).

As you can see, the ordering and grouping of the items that get sprayed onto the page does not have a whole lot to do with what is related to which, and many of the fields have been broken into parts for printing.
posted by flabdablet at 7:34 AM on April 11, 2008


Response by poster: On my home computer I tried pdftotext and a couple of other freeware titles, but to no good result.

Unfortunately I don't have access to the original reports in a different format, though I can see how regenerating them as something else, even HTML, would be much better.

Last night it occurred to me that maybe I could open these pdfs in other Adobe programs. Pagemaker wouldn't open them and Photoshop rasterized them. So neither was a solution.

OCR is another possibility that hadn't occurred to me; I think I'll give it a shot next.

Thanks for your input, everyone.
posted by johnofjack at 10:04 AM on April 11, 2008


A quick Google made it look like maybe having Adobe Acrobat (not just the reader) would make this a trivial exercise, don't know if your company would have a copy, or be willing to get a copy for this. Actually, the links I saw describing this were all older versions of Acrobat, though I would imagine the functionality still exists in the newer versions. (http://www.library.mcgill.ca/edrs/services/publications/howto/pdftoxls/pdftoexcel.html)

Other than that, I agree with JJ86 and flabdablet that doing this one time is going to be annoying, and trying to have a general job set up is probably impossible if you anticipate doing this again in the future.

A couple hours of playing with the data in a database showed me that you can get 99% of the way there without too much trouble, but at some point you're going to have to do some manual adjustments to account for authors without commas or authors with Jr.s (at least, those were the corner cases that kept me from wanting to find a complete general solution.)

If you'd like to see the SQL I used to get most of the way there, I can post that.
posted by jermsplan at 12:24 PM on April 11, 2008


Well w/help I was able to find a tool here (Adobe) but I don't have anywhere I can put the file so that Adobe can look at it properly. Unfortunately, I assume the output will be similar to pdftotext, but hopefully the html version might not fold.
posted by prodevel at 4:12 PM on April 11, 2008


Response by poster: prodevel, that Adobe link was a great idea but unfortunately the tool converts big chunks of the file into images and Adobe doesn't even make them available, so you get about 3 lines from each page followed by a big broken image. (Amazing, really.)

I'll ask tomorrow if we have a full copy of Acrobat; probably either Systems or the Art and/or PR department does. The OCR option's on the back burner for now; I can't help wondering if correcting typos would be as time-consuming as correcting incorrect line breaks. The OCR program I have is really not very good.

jermsplan, thanks for the offer of the SQL. I'm hoping I can do the conversions through Acrobat itself, but I'd be interested in seeing the query just to ponder it a bit and get my level .5 SQL up to 1 or so.
posted by johnofjack at 9:02 PM on April 11, 2008


Best answer: It seemed a shame to completely waste the time I'd already spent on this, so perhaps you can use what I've already done.

Glen Spring.txt has tab-separated fields and is suitable for spreadsheet import (I tried it with OpenOffice.org Calc, and it seems to work OK). Looks to me like it's done a reasonable job of collecting 4771 out of the 4871 quiz results in Glen Spring.pdf.

Glen Spring.discards.txt has all the output lines from pdftotext that didn't make it into Glen Spring.txt. You can cut and paste the last hundred quiz results from that by hand, if you care.

If you put all three of your source PDFs into a folder with the "parse" script, then cd into that folder with Terminal and type

./parse *.pdf

you should be able to recreate the text files above, along with corresponding ones for the remaining PDFs. I'd be interested to know how you get on.
posted by flabdablet at 4:14 AM on April 12, 2008


Response by poster: flabdablet, that sounds great. I'd thought of using OpenOffice for it but I've got some setting toggled wrong; it keeps dumping them as an indecipherable mess into Office Writer instead of Calc. I'll have to look into that; it's heartening to know it works.

Thanks for all your work on this; I do appreciate it.
posted by johnofjack at 10:31 AM on April 12, 2008


I've noodled around with this in Perl, & gotten as far as:
/9340/EN/Snow Joe/Greene, Carol/LG/0.3/0.5/59/Fiction
/36573/EN/Big Egg/Coxe, Molly/LG/0.4/0.5/99/Fiction
/9306/EN/Bugs!/McKissack, Patricia/LG/0.4/0.5/69/Fiction
/41850/EN/Clifford Makes a Friend/Bridwell, Norman/LG/0.4/0.5/67/Fiction
/42156/EN/I Am Lost!/Wilhelm, Hans/LG/0.4/0.5/64/Fiction
... 4766 lines of that sort of thing. Leaving only, um, 1652 lines of header & unparsed stats.

I haven't caught the word wrapped / folded lines yet.

Regex so far:
    m/^(\s*\#*\d+\s*)   # quiz - 1      (\w{2}\s*)        # lang - 2      (\S+(\s\S+)*\s*)  # title - 3      (\S+(\s\S+)*\s*)  # author - 5      (\w{2}\s*)        # IL - 7      (\d\.\d\s*)       # BL - 8      (\d\.\d\s*)       # points - 9      ([0-9,]+\s*)      # words - 10      (\S+\s*)$         # fiction - 11      /x)
I pull the data out later; this pulls out the columns, to properly place the folded lines later. ... I feel like apologizing for the bad form in not getting the data out right here, but it's complicated enough for me right now...

On preview: um.
posted by Pronoiac at 11:07 AM on April 12, 2008


I just updated my parse script - it now does a much better job, leaving only four results behind in the discards file. There are a couple of oddities in the resulting Author field, where the last word of a title is stuck onto the author's surname because there were no spaces at all between them in the pdftotext output, but I don't propose to fix those.

I would be very interested to find out how well it does on the other two PDFs.
posted by flabdablet at 4:32 PM on April 12, 2008


To force OOo Calc to import a text file as spreadsheet data instead of handing it off to OOo Writer, use Insert->Sheet from File instead of File->Open.
posted by flabdablet at 4:35 PM on April 12, 2008


Having just twigged that the # mark in front of some of the quiz numbers actually means something (recorded voice quiz), I've updated the script to pull those out as a separate column. I've also made it produce a CSV output file instead of a tab-separated one, which should make for smoother spreadsheet imports.
posted by flabdablet at 6:31 PM on April 12, 2008


Very cool, flabdablet :)

I've always used korn/sed/perl - these lines aren't commented?

re_ext_title="^ {17,19}($re_words)\$"
re_ext_author="^ {62,71}($re_words)\$"
re_ext_title_author="^ {17,19}($re_words) +($re_words)\$"

interested in the numbers...
posted by prodevel at 7:47 PM on April 12, 2008


"^ {17,19}($re_words)\$" means: start of line (^), then 17 to 19 spaces, then the expansion of $re_words (in parentheses so whatever it matches can be picked up later from $BASH_REMATCH) then end of line ($). The end-of-line $ is escaped with a \ to stop bash trying to interpret it as the start of a parameter expansion.

The numbers reflect the range of positions I found things starting in when examining the pdftotext output.
posted by flabdablet at 8:42 PM on April 12, 2008


I've been away from the biz for a while - your post reminded me of a bunch of favorite hp/solaris commands like cut/paste and strings. Thanks
posted by prodevel at 10:38 PM on April 12, 2008


Sorry I didn't get back here sooner. Anyone who would like to see what the SQL I wrote looks like can see it here: http://docs.google.com/Doc?id=d94kwz4_2d7sd8hdp

It lost most of my pretty formatting when I imported it, but you probably want to put the code into an editor that will color it for you anyway.
posted by jermsplan at 6:58 AM on April 15, 2008


Response by poster: Sorry for the delay; I was off Sunday and yesterday.

Silly question: how do you do this part about CDing into the directory in terminal? I'm using WinXP at work and having some trouble with both finding the terminal (not HyperTerminal) and getting "run" to do anything other than open the folder.

And I've updated my profile just now, because I realized "programming" might seem to refer to PHP, Perl, etc. when what I meant was "programs for the public to attend at the library." I'm okay with software, not very good at all with computer languages.
posted by johnofjack at 10:57 AM on April 15, 2008


Oh Jesus, you're on Windows. Sorry - I had kind of assumed you were using something civilized.

Hang on for an hour while I translate that thing to VBS.
posted by flabdablet at 7:01 PM on April 15, 2008


OK. Unzip parse.vbs and pdftotext.exe into the same folder where your PDF files are, then double-click on parse.vbs. It should run completely silently and just make a bunch of files. For each .pdf, you should get a corresponding .txt, .csv and .discards.txt.

I'd love to know how you get on.
posted by flabdablet at 11:01 PM on April 15, 2008


Response by poster: That script worked beautifully for Glen Springs but resulted in a list of titles the exact same size as the list of discards for the other two schools who sent (years ago) the pdfs, with a 0-byte csv file for each.

The distinction between tab-spacing and pixel-precise placing is an important one; I wish that Adobe would expand its Reader capabilities to include more useful export options (though I can see how they might want people to pay up for those features).

flabdablet, you are a superstar for working on this; you've gone above and beyond. I appreciate all your work; it's much more than I could have asked for. I think maybe the conclusion to all this, though, is that at this point Adobe Reader just does not handle exporting text as well as it should.
posted by johnofjack at 12:59 PM on April 16, 2008


Best answer: OK. This has gone beyond a joke. The non-cooperation of the other two PDFs is now a personal affront :-)

If you could give me links to the other two PDFs, I will have a look at them, and if they're not too far removed, format-wise, from the Glen Springs one, I should be able to adapt the script to parse them properly.

Look, it's not really Adobe Reader's fault. PDF is a page description format, not a data interchange format. The whole point of PDF is to give you a way to make something that looks the same regardless of what you're viewing it on. It's absolutely not designed to preserve logical relationships between the various items that appear on the page.

That's why PDF editors are so thin on the ground. PDF is simply not a useful format for data that needs re-interpreting, even to the extent of fairly simple edits. It's an output format, not a data format. The best way to think of PDF is as electric paper, and the best way to think of extracting meaningful information from a PDF is to consider it one tiny step removed from scanning a printout. It's absolutely not something you'd do if you had a better source available for the same data.

Frankly, I am truly impressed by how good a job pdftotext does in extracting even the first approximation to meaningful information from Glen Spring.pdf. I would be surprised to find a general-purpose export tool that does better.
posted by flabdablet at 4:18 PM on April 16, 2008


Response by poster: Thanks for explaining that, flabdablet; it makes a lot of sense now that Acrobat isn't handling it so well.

The other two pdfs are at http://www.box.net/shared/0slrb46os4 and http://www.box.net/shared/j92bgwm8gs

FWIW, your script's handling of pdftotext's output on the first file does a much better job of parsing the file than the full version of Acrobat does.
posted by johnofjack at 5:57 PM on April 16, 2008


Yeah, well, I should have been able to adapt the script to parse them properly... but they're clearly from a different version of the report generator (some columns missing, some re-ordered) and pdftotext isn't doing anywhere near as nice a job converting them to text. It's folding and breaking lines in some very odd ways indeed. I can't really blame it, though - a quick look at the embedded PostScript strings (with the same technique I used above) reveals that every single text character is its own PostScript string and that bits of the author and bits of the title are all mooshed in together. You really can't get text much more sliced and diced than that.

I'm currently looking at the source code for pdftotext to see if I can tweak it to do what I want. I'm thinking that if I convince it that the font is really tiny, or the page is really wide, something good might happen. But I don't hold out much hope.
posted by flabdablet at 3:54 AM on April 17, 2008


Having played with this a bit, I've come to the conclusion that there is less work for you to do in cleaning up the .txt versions of the last two PDFs than there is for me in fartarsing about any more with the parser.

What you need to do is work through both text files, unfolding folded lines and unsmooshing titles from authors, and making sure that there are at least two spaces between the various fields in each line. At first glance, it looks to me like most lines are OK - I think there are only a few hundred that will want editing. You should also make sure that there is something non-blank in every data column (there are a few lines in "hidden oaks arlist-title.txt" that seem to be missing an "Int. Level").

Once that's done, replacing all runs of two or more spaces with a single tab and deleting all the header lines should get you a file you can import into your favourite spreadsheet.

It's painstaking and tedious work. If it wasn't, I'd do it for you :-)
posted by flabdablet at 9:24 PM on April 17, 2008


flabdablet - Upgrade your pdftotext, maybe? Going from poppler 0.3 to 0.6 removed lots of extraneous spaces for me.
posted by Pronoiac at 9:48 PM on April 17, 2008


Pronoiac, the pdftotext version I'm working with is the current one available at foolabs. Does whatever you're using give you non-smooshed titles and authors from these two PDFs? I think stuff like
    17301    All of Our Noses Are Here and Other NoodleSchwartz                                                  Alvin Tales                    LG            2.4     0.5    English     Fiction
is going to prove more trouble to parse than it's worth.
posted by flabdablet at 11:22 PM on April 17, 2008


Ah. I glanced at the newer, & I compared my .3 & .6, without comparing it to what you got. It fixed stuff i hadn't noticed - like "S p l a s h !" - but not what you pointed out.

That specific line you quoted breaks because "Noodle Tales" gets hidden behind "Alvin Schwartz". (PDF viewer Evince does the same thing.) "pdftotext -raw" gets the order right - "17301 All of Our Noses Are Here and Other Noodle TalesAlvin Schwartz LG 2.4 0.5" - but loses the spacing of the layout. It looks like Glen Spring gets the # at the end of the line, but the other two look like they cough up the data readily.

It looks like the second line usually has the author's first name. Extracting info from both takes on it looks like it would work. Though I can't volunteer for this, at least for the night.
posted by Pronoiac at 12:06 AM on April 18, 2008


I had started fooling around with that kind of thing (making a list of strings from pdftotext -raw and turning them into regexes for a new parser) and then I thought to myself: it took four hours just to make the original bash parser work in VBS; it's going to take many more than four hours to get this thing to work; it's going to take johnofjack far less than four hours to hand-edit the text files he already has, fix up the title vs. author smooshing, put two spaces between fields, delete header lines and replace multiple spaces with tabs.

If somebody was actually foolish enough to pay me to do this, and there were twenty or more of these things to deal with, then sure - full steam ahead. But I think we're well and truly into diminishing returns here.
posted by flabdablet at 1:44 AM on April 18, 2008


Response by poster: Right. I'd just like to thank you all again for your work and to recommend that you stop. :-) You've done far more than I could have asked for, especially for free, and rather graciously at that.

I think the point to be taken from all this is that PDF is not the file format you need when bringing things into a spreadsheet, and that if that's what you're stuck with then your best bet is to try to get the file regenerated as .xls or .csv.

The second-best bet is pdftotext plus a ton of ingenuity and coding, or hours of spare time and a saint's worth of patience.

Thanks again. I'll work it out; please don't worry about it any more.
posted by johnofjack at 9:27 AM on April 18, 2008


I'm not thinking about this too much. Honestly. I've just mulled this over a bit. Flabdablet's kung fu is the best.

My guess on the quickest way to do this:
* Have two scripts, one for each pdftotext output format. Each attempts a simple regex match & coughs up a guess.
* Diff those guesses, & include the common lines untouched.
* Edit that diff, correcting, combining, & removing the bad guesses.

posted by Pronoiac at 12:56 PM on April 26, 2008


« Older Ecommerce software with integrated configurable...   |   How to surmount the enormous obstacle of text... Newer »
This thread is closed to new comments.