Document conversion
October 5, 2014 1:47 AM   Subscribe

How to turn 23 PDFs of at least 65 pages each, in column format, created in QuarkXpress into Word, without having the text made into nonsense by Word deciding there's only one column when there are sometimes two and a pull quote. Any ideas? (Getting Quark is not an option).

I saved the documents from PDF (Adobe Acrobat Pro) to Word and this is how the text is showing up in the Word document:
This is column 1 sentence 1. This is column 2 sentence 1. [end para]
This is column 1 sentence 2. Half of pull quote. This is column 2 sentence 2. [end para]
This is column 1 sentence 3. Other half of pull quote. This is column 2 sentence 3. [end para]

This is how it should show for the purposes of the analytical software I will be using.
This is column 1 sentence 1. This is column 1 sentence 2. This is column 1 sentence 3. [end para]
Pull quote. [end para]
This is column 2 sentence 1. This is column 2 sentence 2.This is column 2 sentence 3. [end para]

Excepting that it's not full sentences, it's only phrases. And it's not mathematically consistent - I can't convert to a table, select all the text from Row 23 onwards and paste to a column beside the first 22 rows, and then concatenate. Or put all the text in a table, and mark it so it sorts first by odd numbered rows, and then even. And of course it has all the stupid unexpected paragraph marks in the wrong places, and graphics that don't really exist (unless I cut and paste as plain text - and then I lose the advantage of any formatting).

One thing I did try, was to cut and paste, page by page from the PDF, and that was somewhat successful, in that it retained a format to the text, which meant I could do things like search for ^p with font 14 pt (headings) and use a placeholder while I took out all unnecessary paragraph marks. BUT I can't see where paragraphs of body text should fall without comparing page by page, and we're talking about 1500 pages. At 4 minutes a page that's 100 hours! There's no budget for that.

I need this as text as I'm working on a research project with one of the editors that analyses the content of a journal over it's entire history. I have asked the editor if he has access to original author submissions but he is unlikely to.
posted by b33j to Technology (8 answers total) 3 users marked this as a favorite
 
Best answer: Have you tried online PDF to Word converters? I use this one and I remember it being a very clean result.
posted by DarlingBri at 2:15 AM on October 5, 2014


Best answer: LibreOffice Writer will open and edit PDFs and can save as MS Word format. Not sure if it will handle the tables better or not, but it's free, so giving it a try is low risk.

XPDF also includes the command-line utilities pdftotext and pdftohtml. You can give these a try and see if they produce something that's easier for you to work with. I see on their website that they have a Windows package that includes these utilities.
posted by duoshao at 3:30 AM on October 5, 2014


Best answer: If you can get the original Quark files there is a free version of QuarkXPress which should let you select using the original flow of the text. If that doesn't work ABBY PDF Transformer+ should handle the multicolumn stuff, but I'm not sure about the pull quotes.

If all the pages are the same (unlikely), and you're familiar with the command line on Linux you can hack something together with pdfcrop and pdftotext. I used this for 1000+ PDFs some years ago with some success.
posted by Baron Humbert von Gikkingen at 3:40 AM on October 5, 2014


Best answer: Basic question: why do you have to convert these PDFs into Word documents?
posted by Brandon Blatcher at 4:29 AM on October 5, 2014 [1 favorite]


Best answer: Have you tried to open the PDF in Word itself? Word 2013 can read PDFs directly.
posted by elgilito at 4:59 AM on October 5, 2014


Best answer: Ah yes; those lovely old packages that produced output for CRT-based phototypesetters did stuff like rendering the page from top to bottom rather than in reading order … As plinth recently pointed out, PDF is merely marks on a page, and any text that our eyes might be able to pull out is a coincidence.

If the columns are predictable, you can script Poppler's pdftotext or pdftohtml to read each column on a page to a separate file using their -x, -y, -W and -H options. Pullquotes and illos might be a bit of a problem, but if you're doing a monitor-corpus style text analysis, you should be fine with a little light editing.

You may lose text styles. Depends how much work you feel like putting in to get them back, or whether the PDF actually encoded that information in a useful manner anyway.

Drop me a link to an example file via memail, and I'll take a look. I've done this sort of thing before.
posted by scruss at 5:49 AM on October 5, 2014


Best answer: If you can't get it to work with technology maybe use one of the freelancer sites to find somebody that will retype it for $5 an hour?
posted by COD at 6:03 AM on October 5, 2014


Response by poster: DarlingBri, thanks, I gave that a go, and it does have some qualities that I appreciate, but I had to extract pages first (as the file was too large) and anywhere there is a pullquote (every page), it's produced a table that I certainly could merge cells on, and then cut and paste to get them in order - which is then a time issue. But good to know about that.

duoshao, Regarding other conversion options, I've tried 4 now, and there are similar problems with each.

Baron Humbert von Gikkingen, I've been informed that the original Quark files are not available.

Brandon Blatcher, it's for inductive research methodology of grounded theory, through computer-aided analysis of the content (but not the references/headings/ editorials/ advertisements) of the entire publication, examining themes and concepts and their relationships. So the reason it is going into Word is to that I can be certain that the university's software will be able to read the sentences not only in order, but be able to discern where paragraphs start and end, and so that I can delete irrelevant material so that it doesn't skew the results from repetitiveness.

elgilito, an interesting thought - unfortunately, only hieroglyphics ensued - but certainly I will keep in my box of tricks for next time.

Thanks for all your help. It's nice to have another expert to go to. I've discussed this with the client/academic, and we're taking a different path now (not relevant to this question).
posted by b33j at 6:39 PM on October 5, 2014


« Older Itchy eyed dog...ideas?   |   Please teach me to make jambalaya Newer »
This thread is closed to new comments.