Extracting Text from PDF WITHOUT also extracting images
June 12, 2016 2:39 PM   Subscribe

I'm looking for an easy, free way to extract text only from a very large PDF file and create a word document.

The problem I'm having is that the text in the PDF is overlaid over images. I've tried extracting the PDF file to Word format. While it did put the doc in Word format, it included the images and looked very messy. When I try to remove the images, it also removes the text or creates such a mess I have to retype it all again. In addition, Word keeps on crashing because it just can't seem to handle all the crazy formatting

Is there an easy, low-budget (preferably free) way to extract just the text?
I'm not a super tech-savvy person or a coder, so I'm looking for something easy enough for a five year old to do.

Thanks!
posted by firemonkey to Technology (10 answers total) 2 users marked this as a favorite
 
ExtractPDF can pull text only, but you'll lose the formatting (it saves as a txt file). You'd have to then paste the result into a Word file and go through it to re-add formatting, remove all the forced line breaks and so on.
posted by sailoreagle at 2:57 PM on June 12, 2016


Is there a possibility that the words are "burned" into the images? If so OCR would be the only option but with images expect poor results. There sure are a lot of pdf tools, perhaps try one that is online.
posted by sammyo at 3:02 PM on June 12, 2016 [1 favorite]


Can you say more about what your final goal is? Do you just need to be able to read this text yourself, or does it have to look presentable for someone else? Do you need to pull the text out so that you can use it as part of another project, or is the end goal just to get this to a place where you can read it and that's enough?

I ask because the difficulty will vary a lot depending on the end use. If you just need to read it, I'd actually try messing around with the accessibility settings in my PDF reader before I tried extracting the text. You can often change the color of the text, and sometimes you can give it an opaque background that might make it readable on top of those images. If you haven't already tried it, that would be my first thought.

If you really need to get it out of the PDF and you need to preserve formatting, that's going to be a lot harder. I don't have an answer there, though someone else might. You could look into deleting the images rather than extracting the text, though—I have a feeling that might be an easier path to your goal.
posted by Anticipation Of A New Lover's Arrival, The at 3:07 PM on June 12, 2016


Response by poster: The end goal is to have a working file for authors to edit in Word. The text files used to create the PDF are long gone. I tried the Free PDF extractor, but the file is too big. I've also tried other online extractors with no luck.
I just have to think there's an easy way to do this. Thank everyone so far for your great responses!
posted by firemonkey at 3:14 PM on June 12, 2016


Best answer: Does select all (from the pdf) copy and paste into word work? Then hover over the "paste options" icon and click on "Keep Text Only." I just did a test with a pdf file and Word 2010 - nothing else needed.
posted by eisforcool at 3:34 PM on June 12, 2016 [1 favorite]


Hmmmmmmm. I might actually try converting the PDF in Calibre, which is actually a (free) ebook management program but which is pretty strong for converting between different formats. PDFs are notoriously hard to deal with though, because the format is really designed for printing first and reading second, with editing as a distant afterthought. The placement of the text on the page doesn't necessarily bear any relation to the order in which that text is stored in the file, for instance, and it could very easily be hard-coded into an image or something awful like that.

Word documents aren't exactly known for playing nice with file converters either, so it's a bit of a double whammy. You essentially want to go from the worst format in terms of text conversion to the second worst. (Not that that's your fault; your situation is a depressingly common one.)

Still, what I would try would be to load the PDF into Calibre and see if I could get it into a more flexible format without ruining it. Calibre will at least attempt to convert directly to .docx so give it a shot (and you may be able to tell it to lose the images in the conversion settings dialog) but I have a feeling that a plain .txt file might be your best bet as a first step, since it will strip out all the images and formatting and try to just place the text in some kind of reasonable order. Then you could paste it into Word and reformat manually.

I would probably even just try the old trick where you select all and then copy and paste into a text editor, again dropping all the formatting and images and leaving you with just the bare bones. Then, again, reformat by hand as needed.

Good luck. The situation you're in is one of those ones that is really common and should be easy, but which is likely to be a pain in the ass.
posted by Anticipation Of A New Lover's Arrival, The at 3:43 PM on June 12, 2016 [2 favorites]


Best answer: Highlight all, copy, paste into Notepad, copy that and paste into Word?
posted by penguin pie at 3:57 PM on June 12, 2016 [1 favorite]


pdftk will do it for you. I've only every used the command line tools mind you. I'm wrong. xpdf has a pdftotext utility.
posted by singingfish at 4:25 PM on June 12, 2016


Could you share the PDF file with us? Or at least extract one page from it so that we could take a stab at it? It'd be easier then to see which solution works or doesn't.

Also, do you have a single PDF file to convert once, or is this a regular task with which you are faced?
posted by vert canard at 6:16 PM on June 12, 2016


Forget about Word until the very end. Save the file as plain text and then open in an editor that recognizes and uses "regular expressions" to reformat and fix up the text. It is very likely that you don't know how to do this. Hire someone who does.

Then convert to Word or to another word processing format as needed.
posted by yclipse at 3:40 AM on June 13, 2016


« Older Liquidating Barbie   |   What happened if you died abroad in the British... Newer »
This thread is closed to new comments.