How can I nuke bad OCR text in a log of PDFs and replace with good?
May 11, 2016 8:18 AM Subscribe

I've inherited a ton of PDFs of scanned documents that are somewhat readable. The source were actual documents scanned into a scanner. Now I'm trying to make the OCR'd text readable by screen readers in one way or another. Looking for advice or even keywords to Google. Situational specs ahoy!

So I have a bunch of PDFs that I'm trying to make readable by screen readers. Rescanning the documents are not an option.

Here's what I have:

One Zillion PDFs I have to take one or two at a time, which I can do. I'll take them and do them as they come, as long as I can set up a repeatable process.

Word 2007.

Photoshop CS 5.1

Adobe Acrobat X Version 10.1.16

Here's what I've been doing:

Opening the PDF, saving it as a new document.

When I use the tools to OCR the dickens out of it, I get all kinds of crap:

Wh,y you allould newer refUse c!ooaert. In Flnlanc!...

Which is not a big deal. I can fix the text. In theory. I just can't get it back into the document. When I look at the document within the Order pane (reading order), each page has a little "box" under it with the bad text. I can clear the page structure, but when I draw on the page with the TouchUp Reading Order tool, to put in new text, the garbarge is pulled back in.

How can I either edit this text, or stop Acrobat from picking it up in the first place?

My other option seems to be to sanitize the heck out of the document, just have each page be "one" figure, and put the entire page's text in there instead of using the little boxes that Acrobat so helpfully puts in there.

posted by tilde to Writing & Language (13 answers total) 2 users marked this as a favorite

I am not sure how you'd push the text back into PDF in a manual way...you'd want something which pushed out an editable PDF at the end. (Or, you might be able to do all this in a Word environment, then re-save as PDF.)

What OCR tool are you using?

This free thing I found allows output of OCR in PDF. You can download a free trial version for Windows.
posted by Riverine at 8:39 AM on May 11, 2016

Everything was given to me as a PDF. Scanned images of sales flyers, for example.

When I go to use Acrobat X's "read aloud" feature on an untouched document, I get a message like "the page is blank".

So I go to Tools > Recognize Text > In this file

and I get lovely reading orders and groups, but crap OCR results. I want to edit whatever it is that has that crap OCR result and put in what I know is good text.

from

Why, you alloud newer refUse c!ooart.

to

Why you should never refuse dessert.

I don't care how bad or good the image looks. I care about what the screen reader in Acrobat and other software says aloud.
posted by tilde at 8:50 AM on May 11, 2016

Have you used Acrobat's "Find first OCR suspect"? It should let you make edits without changing the reading order.

However, in cases where I was in your shoes, I had to pretty much rebuild everything from scratch. Your one-figure plan would be the easiest way to achieve that. Creating accessibility in PDFs at the post-print post-scan post-OCR stage is extremely labor intensive (and you are the labor).
posted by BrunoLatourFanclub at 9:46 AM on May 11, 2016

When I take a doc fresh from the archive, I can use Tools > Recognize Text > OCR suspects and pick either Find First Suspect or Find All Suspects and it finds nothing.

When I read aloud this fresh PDF, it says "Warning, Empty Page". The error I get before that, printed on the screen is ...

Scanned Page Alert

This page contains only an image of a scanned page. There are no text characters. Would you like to run character analysis to try to make the text on this page accessible?

And then my options are DO not show again (toggle) OK and Cancel.

If I hit "cancel", I have nothing in the "Order" tab (except thie individual pages) but I can't then use TouchUp Reading Order to capture things and put in places people can click and see what that little thing is.

I tried adding a Tag but that doesn't look like the right way to do things either (the tag text was not read out loud, I got the same empty page warning).
posted by tilde at 10:14 AM on May 11, 2016

"fresh" document Tools > Accessiblity > Add Tags to Document

does the plain thing - One page = 1 image so I can't call out hotspots or break up the image at all into a reading path.

But I get no crap, as some consolation. Just the fact that it's an image, with a width and height. And I can add mondo alt text.
posted by tilde at 10:34 AM on May 11, 2016

Have you investigated the possibility of converting the files into word docs?
posted by oceano at 1:47 PM on May 11, 2016

Hm. Maybe, just to be able to put the damn text in there as WORD alt text. Trying ...
posted by tilde at 1:55 PM on May 11, 2016

Well, you have a whole bunch of questions going on, over here.

If you get a fresh scan from your archive of unprocessed documents, and you want to use Acrobat's OCR tools to process 'em, then first you use Tools -> Recognize Text -> In this file. Only then can you use Find Next Suspect or Find All Suspects. Then the "Find Element" dialog tells you to revise the OCR'ed text directly, to "click on the highlighted object in the document and type in the new text."

If you are turning scans into readable text for accessibility purposes, you do want to tag it up after having OCRed, corrected, touched up, and otherwise prepped for screen reading. A fresh document consists of a single element - that scan - in a PDF wrapper, so of course when you tag a fresh document you can't break it up.

Are these documents share-able? Your other solutions lean in the direction of brute force (Word alt text?), so unless you want to actually be scanning and rebuilding the structures of these docs in Word - a labor intensive task - then there should be some way to get the tools in Acrobat to do this for you.
posted by BrunoLatourFanclub at 1:56 PM on May 11, 2016

Converting to Word just gave me funky town text.

BrunoLatourFanclub I hear you. I don't know how to use Acrobat Pro X much, I don't know 508 much (not going for 508 but just readable). And I'm trying to figure out how to figure out how to do it repeatably by me or a helper if I get one.

What I have are PDFs created some ancient time ago. These PDFs are low-res (100dpi) scans of printed sales flyers. I no longer have access to the sales flyers. All I have are these "fresh" PDFs. Fresh as in I haven't run OCR on them.

What the goal is is to provide these crappy looking PDFs to users who do and do not need to use screen readers. And for those with screen readers, this should be 'readable' by their screen readers. While preserving the original look of the scans.

I can't get new scans. I have to work with what I've got. I've downloaded a third party program that claims it can fix things but we will see if that actually works.

Okay, so I used Acrobat Pro X to convert the document into a Word doc. That "loses" the image and converts it into text that looks like the images but is not the images. So I miss on "preserving the look of the scans".

I've converted them into individual images, and I'll drop them into Word, slug in my ALT TEXT, and convert to PDF and cross my fingers.
posted by tilde at 2:25 PM on May 11, 2016

Well, putting it through Word and tagging it with ALT TEXT is the same thing as killing all of the document structure from a "reading out loud" standpoint. I'll try the third party editor tomorrow, and do the PDF > JPEG > INSERT JPEG TO WORD > ATTACH ALT TEXT IN WORD > FILE | SAVE AS | PDF.

Will try the third party editor tomorrow to see if we can preserve OCRd hotspots and reading order instead of one page = one glob of text of doom.
posted by tilde at 2:34 PM on May 11, 2016

100 dpi, unless these are really crisp 8-bit greyscale, will not return OCR worth a damn. It could well be quicker, cheaper and more accurate having all of these keyed manually.
posted by scruss at 2:57 PM on May 11, 2016 [1 favorite]

Yeah, the dangers of inherited works. :) I'm lobbying for at least 150dpi going forward. Today I'll try the third party app, and an idea that came to me overnight ... layer invisible text boxes to mimic hotspots on the image in the Word doc ...

/starts looking for her Rube Goldberg hat ...
posted by tilde at 6:01 AM on May 12, 2016

an idea that came to me overnight ... layer invisible text boxes to mimic hotspots on the image in the Word doc ...

NO.
posted by tilde at 6:48 AM on May 12, 2016

« Older Echoes from the other side | What is it like being a tourist in Chile during... Newer »

This thread is closed to new comments.

Ask MetaFilter

How can I nuke bad OCR text in a log of PDFs and replace with good?
May 11, 2016 8:18 AM Subscribe

Tags

Share

How can I nuke bad OCR text in a log of PDFs and replace with good? May 11, 2016 8:18 AM Subscribe

Tags

Share

How can I nuke bad OCR text in a log of PDFs and replace with good?
May 11, 2016 8:18 AM Subscribe