How to produce an editable PDF doc?
September 26, 2014 12:00 AM   Subscribe

I have a scanned document with some text that I'd like to edit.

I've done my research and tried OpenOffice, LibreOffice, and, finally, a trial version of Adobe Acrobat. I am unable to select the text with any of these programs. I believe my document is being viewed as a picture, because when I click anywhere on the document, it just selects the whole thing.

Then I used camscanner on my phone to scan the document, then upload as a PDF, but I am still only able to select the whole image.

Is there any way in the world for me to convert this document into a PDF? I think it has something to do with optical character recognition, but I am out of ideas.

Thanks.
posted by massofintuition to Technology (6 answers total) 3 users marked this as a favorite
 
On a Mac the best software to edit and ocr PDF's is PDFpen Pro.

A paid Evernote account will do it as well.
posted by Mac-Expert at 12:04 AM on September 26, 2014


If it's saved as a PDF right now, can you use your Acrobat Pro trial version to convert it to Word? You could then edit it in Word and save it back to PDF.

Instructions for Acrobat XI here

I have an older version of Acrobat Pro X and this function uses recognition to try to rebuild the document from a scan when the text is an image (and therefore unselectable). I do get some gibberish sometimes, especially when there are borders and tables, but it can be better than recreating a document from scratch.
posted by mochapickle at 4:01 AM on September 26, 2014 [1 favorite]


The last time that I tried this it needed to be scanned tiff and then there was this Microsoft program.... Something ready....
posted by notned at 5:20 AM on September 26, 2014


Latest version of Tesseract outputs PDF with selectable text over the converted image. No OCR is perfect, though.
posted by scruss at 5:31 AM on September 26, 2014


Just a littlelot of background on PDF and OCR to set your expectations -

PDF is a file format that is made to be able to represent marks on a page.
It dictates neither how those marks are made nor whether or not that carry any meaning.

In other words, from a visible standpoint, you may not be able to tell the difference between the a page with the character Ä, a page with an A with two periods over the top, a set of Bezier curves that look identical to an A with an umlaut, or a scanned image of an A with an umlaut.

In your case, you have a scanned document which means that you took a picture of text. The shortest path to PDF is to make a single page with a single image placed on that page. While not trivial, this is at least straightforward to do.

If you want to have honest to dog selectable text on the page (and trivia note here, I wrote the text selection code that ended up in the original version of Acrobat - this is also a non-trivial problem especially with maps - and it took me 3 months to do), you have to place honest to dog text and not pictures of text.

So in comes OCR (optical character recognition). This is a process that has been around for quite a while and it stinks. It looks at an arbitrary picture, tries to segment it up into characters, tries to match the characters to knowns and then tries to stitch them together into words. For OCR software that's not very good at recognizing characters, there is invariably another step that uses trigrams and dictionaries to improve the reliability of the recognition (which also, due to keming may turn the sequence gems into penis).

In PDF land, there's this glorious hack available in text placement. You can place text that is invisible. There are 3 bits of information that control how text is drawing: draw the outline (or not), fill the text (or not), clip to the text (or not). If you have this value set to all "or not", the text is placed but is invisible and selectable. This makes it ideal for an OCR engine to add it on top of an existing image without affecting the rest of the page, and in practice is what most OCR engines do when generating/working from a PDF document.

So when you OCR a PDF document into another PDF document, you are as likely as not to still have that picture there as well as text added.

That's brings us to editing.

PDF is not a simple-to-edit format. It wasn't designed to be and it's hard to explain this to users.

Users open a PDF and think, "hey - this looks just like a WYSIWYG editor! I should be able to just start typing!" and nothing could be farther from the truth.

Most WYSIWYG editors have a huge advantage over PDF in that they only let you do very carefully arranged subsets of the whole "marks on a page" thing. Editing PDF - that's a whole other thing entirely. Getting Illustrator to import PDF was hard enough and Illustrator shares a similar model of operation to Acrobat.

Editing your OCR'ed PDF document is going to be especially entertaining because there are two representations of the text on the page - a pretty picture from your scanner/camera and and invisible text from the OCR engine. It's effectively not editable.

Here's what you really want to edit the content: text extraction.

Use a text extraction tool to remove all the OCR'ed text or use the OCR engine to generate a text file instead of PDF. Put that into your favorite editor and Bob's your uncle.

Unfortunately, Bob's also your creepy uncle who substitutes 'penis' in for other words just to see if you're paying attention. Bob makes errors in about 5-20% of the words on each page, depending on the content. If it's a document with a great deal of jargon in it (for example, my whole screed here), you're in for a lot of work fixing up Creepy Uncle Bob's mistakes.

Tesseract is an an OCR engine (IICR) from the 90's, originally bundled by HP and it worked OK. It was open-sourced and taken over and has been improved somewhat.

If you compare it to commercial OCR engines, it's pretty awful. ABBYY is among the better ones in terms of document segmentation (figuring out what's a picture and what's text). Transym is one of the most accurate OCR engines and if your document looks a lot like a typical PhD thesis, it will do very well. I.R.I.S. is decent enough - does some nice things with tables. Nuance (Scansoft) last I checked was a very old engine on the same vintage as Tesseract that has been maintained through the years.

I work a lot with OCR engines (for tool integration), but I very rarely work with these companies' end-user products.
posted by plinth at 6:47 AM on September 26, 2014 [3 favorites]


You just want the text without having to type the text? Google can OCR a JPG, PNG, GIF or PDF for you
http://computers.tutsplus.com/tutorials/how-to-ocr-documents-for-free-in-google-drive--cms-20460
posted by Ness at 7:41 AM on September 26, 2014


« Older GONNA MAKE U SWEAT   |   Help keep my hyper baby happy Newer »
This thread is closed to new comments.