December 9, 2019 3:01 PM   Subscribe

What OCR program to use on a PDF, to make it searchable, while retaining the format?

I don't mind a reasonable subscription fee if that is what it takes, no-fee preferred of course.

This is for an Appeals Court filing, where the Record on Appeal needs to be searchable, but otherwise match the input PDF formatting.

I have more than a thousand pages in PDF format. There are some JPEG exhibits included in there, and some of those images are of text documents. The First Department Court Clerk, said, "run the whole thing through OCR, if it's a JPEG exhibit and won't OCR, then don't worry about it, but do run OCR on the whole thing". I think the result they want is you can highlight letters and words, and I guess search for words and phrases in the resulting PDF format.

Some attorneys deliberately mess with you, by printing and re-scanning their filings (maybe with a tiny tilt) to defeat searching. I need, as much as possible, to make the text searchable, this kind of filing is, after all, 100% 12 point Times New Roman, double spaced, so maybe not too impossible?

I don't mind breaking it up into 50 page chunks, or whatever, but I don't want the program to choke on a JPEG that it can't resolve. Just pass that page through, and move on.

A fast, idiot-proof OCR is what I need. Have a looming deadline, so no time to experiment.
posted by StickyCarpet to Computers & Internet (12 answers total) 8 users marked this as a favorite
GoldFynch is an online service that will do this for you, and you shouldn't have to pay anything to try it out for a single file.
posted by hobu at 3:19 PM on December 9, 2019 [1 favorite]

Best answer: Google Tesseract will turn JPEGs into OCR'd PDFs. It will only do one JPEG to one PDF, so some scripting may be required.
posted by scruss at 3:27 PM on December 9, 2019 [3 favorites]

I use ABBYY FineReader. It's worth every penny. I use it for thousands of pages a month, such as scanned public domain books off of Archive.org and Google Books that weren't done by those companies to my satisfaction. This is the program you buy, keep updated, and have ready for the next time you need this magic trick.
posted by Mo Nickels at 3:50 PM on December 9, 2019 [2 favorites]

PS: Google Tesseract was nowhere near good enough for professional-level work the last time I tried it. It had something like errors on 40 to 80% of the words it tried.
posted by Mo Nickels at 3:52 PM on December 9, 2019 [1 favorite]

PPS: AABBY FineReader has done books of more than 1500 pages for me, will straighten crooked pages, will split pages on a spread (so that if you scan pages 102 and 103 as a single image, it will recognize that and then split them into two separate PDF pages in the resulting final PDF doc), recognizes multiple languages, is scriptable, handles images with aplomb by recognizing them as images and getting the text around them, and embeds all scanned OCR text into the final PDF for searching without messing with the quality of the PDF itself. It's a gem. I tried something like 15 programs before I settled on it, including Adobe Acrobat Pro.
posted by Mo Nickels at 3:57 PM on December 9, 2019 [2 favorites]

I second (third?) Abby Fine Reader. It is used for OCR in high-volume environments. It works really well.
posted by ddaavviidd at 5:17 PM on December 9, 2019 [1 favorite]

The full version of Adobe Reader does this automatically. Open up the PDF or add the JPEGs you want OCRed into a new document, then go to Tools and 'Edit PDF' and Adobe will auto scan every page and align the whole document. Then just save/resave.
posted by 0bvious at 5:03 AM on December 10, 2019 [1 favorite]

I am a lawyer as well, and we use Nuance Power PDF to do all sorts of things, including this. It's cheaper than Adobe Reader.

In future, the way you save PDF materials can affect whether the text is readable. I have managed to assemble files (with bookmarked sections and index) of about 400 pages digitally without having to run any OCR and having the text be readable. That was with Power PDF as well though.
posted by lookoutbelow at 5:55 AM on December 10, 2019 [1 favorite]

Response by poster: lookoutbelow: I am a lawyer as well

Just FYI, I'm not a lawyer, I'm pro se. Had to get into this litigation thing to protect myself from the bad guys. (I'll let you know if I prevail in this First Department appeal.)
posted by StickyCarpet at 9:58 AM on December 10, 2019 [1 favorite]

FYI Adobe Acrobat Pro will do this, I do it all the time. I don't know how well the current DC version works; it's a monthly subscription and I'm still on Acrobat Pro 11. But it looks like you can get a 7 day free trial, which might be all you need.

FYI it does things like de-tilting any scanned text and your final PDF looks exactly like what you started with (except, perhaps, straightened a bit) except it is searchable and copy/pasteable.
posted by flug at 2:54 PM on December 10, 2019 [1 favorite]

Response by poster: I tried the trial version of AABBY FineReader, but couldn't even find OCR on that menu.

Ended up using Tesseract, which did a very good job, but only on one page at a time.

With bash scripts on linux, PDFTK was used to split the file into 1,300 separate 600 DPI PNG's. Tried different resolutions, 600 DPI seemed to work the best fo me.

Then ran tesseract on each page, via bash script, and merged it all back into one PDF.

Thanks for the help, everyone!
posted by StickyCarpet at 3:07 PM on December 12, 2019 [1 favorite]

Response by poster: Also, PDFTK comes with a graphic front-end, as PDFchain. That was the easiest way to merge the single page PDF's back into one PDF.
posted by StickyCarpet at 3:32 PM on December 12, 2019 [1 favorite]

« Older How do I help my sister with deep, lifelong...   |   Nada on NATO Newer »
This thread is closed to new comments.