Join 3,561 readers in helping fund MetaFilter (Hide)


But I don't *want* anyone to recognize it...
January 25, 2013 11:44 AM   Subscribe

How do I prevent OCR on a document (typically a PDF but I could use another document format if necessary)? I know that when I scan it from a hard copy to a PDF I can disable/stop the OCR process, but Adobe allows it to happen on any PDF I scan in, whether OCR was eliminated at scanning or not, and I have to stop that (I have work product I'd like to distribute electronically, but my boss would like to make sure it's not searchable and it's as hard as I can make it to copy). I can use any software or process within reason.
posted by mrs. taters to Technology (27 answers total) 3 users marked this as a favorite
 
OmniPage Pro has an "image only" scan option. You can also load pdfs and have the OCR option turned off.
posted by Melismata at 11:48 AM on January 25, 2013


How do I prevent OCR on a document (typically a PDF but I could use another document format if necessary)?

You can't. It's an analog hole problem at best compounded by the widespread support for the PDF format and any other sane format for distributing text.

The best you're going to do is create image-only PDFs, which can be large and/or difficult to read.
posted by Inspector.Gadget at 11:52 AM on January 25, 2013 [9 favorites]


Do people still need to be able to read it? Because if someone can see it, it can be OCR'd. You can't stop me from taking a screenshot or a photo of my screen and running that through Tesseract.
posted by SemiSophos at 11:53 AM on January 25, 2013 [8 favorites]


I hate to come in and not answer the question, but every time I hear something like this I have to say it's a fool's errand. I'm not sure what your business reasons are for making it not searchable but if it's valuable enough, someone can just print it, scan it, and there you go. Oh, printing disabled? Decrypt it and remove the restrictions. You might even be able to get the OCR working directly that way. It's the same thing I say to people who want to prevent people from right clicking on their web pages. It's trivial to get around.
posted by rocketpup at 11:53 AM on January 25, 2013 [7 favorites]


Do you mean that you want to (1) prevent someone determined to OCR the document from OCRing it or (2) to disable OCR features of Adobe PDF Reader(?) or other similar software? Or did I mis understand completely.

Not sure about 2, but for (1) it's almost impossible with any sort of print font, you'd need to essentially re-write it by hand (OCR of handwritten font is very difficult).
posted by pyro979 at 11:54 AM on January 25, 2013


You can't. As you've noted, you can scan it as "image only" or "searchable" (or the equivalent on your scanner), but the whole point of OCR is that it can take your "image only" and convert it to a searchable, editable file.

The law firm I used to work at had this fully automated--just forward an email with any non-searchable PDF to a specified address, and it would be converted to a Word document in moments.
posted by Admiral Haddock at 11:54 AM on January 25, 2013 [3 favorites]


Encrypted PDFs have various options to disable printing, selecting, and so on. Nothing requires a PDF reader to obey those options, of course, but many do.

Otherwise, your best bet is to flatten each page into a giant bitmap, put the text over a cluttered background, and maybe distort it irregularly. You're trying to achieve the same goal as a CAPTCHA: to make something a human can read but a program can't. It's a difficult thing to do.
posted by hattifattener at 11:55 AM on January 25, 2013 [1 favorite]


Do people still need to be able to read it? Because if someone can see it, it can be OCR'd. You can't stop me from taking a screenshot or a photo of my screen and running that through Tesseract.

Yes: as CAPTCHAs illustrate, modifying an image of text so that it is difficult to OCR also makes it frustratingly hard for a human to read.
posted by James Scott-Brown at 11:56 AM on January 25, 2013 [1 favorite]


If you're asking how to disable OCR on Acrobat, I don't know, but I suspect you can strip the text layer by doing a print-to-PDF from your favourite PDF viewer (making sure to hit the "as bitmap" option if there is one).

If you're asking how to make a PDF impossible for anyone else to OCR -- as others have pointed out, you can't stop people making screengrabs and OCRing them, though it's an inconvenient process for a long document. I can envisage technical ways of making it hard for another program to OCR the PDF*, but I don't know of any off-the-shelf programs that do it. If it's important enough, you could contract a programmer to do this, but it's still just making it harder rather than making it impossible.

*If I had to implement this, I'd try splitting the image of the text into two complementary masked "checkerboard" images and stack them as layers to reconstruct the whole image on each page. That way there isn't a complete bitmap which a program can pull out, which I suspect would hobble most automated OCR programs. That's just a guess though...
posted by pont at 12:14 PM on January 25, 2013


I would also recommend against accepting responsibility for ensuring that your document is never OCR'd or made searchable. Management may try to bully you into it if they've already promised it but -- as has been stated previously -- it's an impossible promise that can get you into trouble if someone finds it worthwhile to circumvent your protections.
posted by rocketpup at 12:29 PM on January 25, 2013 [5 favorites]


You can create the whole PDF as one big raster bitmap, but that won't stop people from using screenshot tools. Taking a screenshot is trivially easy. As the previous poster said, don't accept responsibility for creating something that's "copy proof". Anything can be copied.
posted by thewalrus at 12:33 PM on January 25, 2013


Cursive handwriting, scanned in and saved as a PNG image.
posted by mr_roboto at 12:37 PM on January 25, 2013 [1 favorite]


I've seen some impressive OCR'ing of cursive in the past year. If it's legible it's OCR'able.
posted by a box and a stick and a string and a bear at 12:39 PM on January 25, 2013


Nth-ing everyone else: if it can be read, it can be converted to a searchable format. Even if it involves having someone print out the PDF and type it back in - surprisingly affordable and quick. (Taskrabbit, Amazon's mechanical Turk, Craigslist...)

If you want to make it as hard as possible, write it out by hand and scan it back in. But that just makes it harder, not impossible.
posted by RedOrGreen at 12:43 PM on January 25, 2013 [1 favorite]


If we're talking like, valuable technical documents and not like, Dungeons & Dragons manuals, it's pointless. If you're selling a training course for $500, you can hire someone on odesk for $5/hr to re-type it.
posted by Oktober at 12:43 PM on January 25, 2013


If you use a regular pattern of dots (like this http://goo.gl/qI9Io) as a background, that will screw up almost every OCR engine ever written -- most (all?) OCR relies on edge detection and if your type's edges are goofy enough the OCR will come out... interesting to say the least.

There are also some tools out there that will further prevent copy/paste, etc. of your document - Lizard Security PDF Security

I have not used that tool, but it should work as a starting point to search for particular features.
posted by drfu at 12:45 PM on January 25, 2013


Easy! Do you have a word version or the original version of the file? If so, use the windows snipping tool on the portion you do not want OCR'd. Remove that text from the original document and replace it with the newly snipped portion. It is now an image file, which cannot be OCR'd.

I just tested this out and Adobe OCR'd everything but the snipped portion.

Alternatively if you do not have access to the original document, can you redact (if you have Pro) and do the same steps above?

But yes, handwriting. I would be interested to see the OCR software that can read cursive.
posted by wocka wocka wocka at 2:17 PM on January 25, 2013


It is now an image file, which cannot be OCR'd.
I do not think OCR means what you think it means.
posted by b1tr0t at 2:41 PM on January 25, 2013 [11 favorites]


Some organisations who make things like e-textbooks go down the alternative route suggested by drfu above. They use systems like VitalSource to
1. Provide a screen reader for the material
2. Require users to register before they can see the material
3. Clearly warn that the material cannot be copied.
4. Watermark any material that is printed out out with the name of the person who printed it and a copyright message.

This system can also be circumvented by manual transcription - or by re-processing the PDF - but it does combine a reasonable practical and moral deterrent with a readily readable and electronically indexed document.
posted by rongorongo at 2:52 PM on January 25, 2013


It is now an image file, which cannot be OCR'd.

In fact, that's the entire function of OCR. It wasn't invented to turn digitized text into digitized text.
posted by LonnieK at 3:16 PM on January 25, 2013


Ok - well you can say what you will, but you can try it and see that it works. I do actually know what OCR means.
posted by wocka wocka wocka at 6:09 PM on January 25, 2013


Easy! Do you have a word version or the original version of the file?

OCR digitizes non-digital text. If you have a Word version of the file, why would you need to digitize it? It's already digitized.
posted by LonnieK at 6:55 PM on January 25, 2013


This won't end well. If your boss wants you to go to great lengths to make this un-searchable, then I have to think that it is valuable enough that people will go to great lengths to convert it to searchable form (like printing it, rescanning it, and posting pages to MTurk for transcription). Similarly, if you are willing to make it hard for users to read the document in the service of making it impossible for them to search the document, then I have to think the contents are high-value, and worthwhile for someone to make the contents more accessible.

I'd suggest not going beyond the simple. You might distribute it as an image-only PDF and annotate each page with the words "Prepared for the exclusive use of Recipient Name" (which could be accomplished using mail-merge functionality).
posted by Good Brain at 7:48 PM on January 25, 2013


Is your boss concerned that the people with whom you intend to share this document will mis-use it? In that case, any means of distribution, digital or analog, puts the document at risk. The only way to keep it safe is to keep it out of untrusted users' hands. You can neither control nor be responsible for the actions of users you cannot trust.

OTOH, if the boss believes that digital distribution inherently puts the document at greater risk of being co-opted by someone you DON'T intend to share it with, you may have some options. For example, you could encrypt your distributed document and share the key only with trusted users. Now, you just need to know who to trust.

The risk of mis-use, unfortunately, cannot be controlled by technology -- only by people.
posted by peakcomm at 8:22 PM on January 25, 2013


wocka wocka wocka: Ok - well you can say what you will, but you can try it and see that it works. I do actually know what OCR means.

Right, so if you use Acrobat to open a pdf containing a mixture of paragraphs of actual text and images of paragraphs, attempts to OCR it with "Document->OCR text recognition->Recognise Text Using OCR" will fail with the error message "Acrobat could not perform recognition (OCR) on this page because: This page contains renderable text".

A trivial solution to this is to first get rid of the renderable text. For example, you can export the original PDF to multiple images using Acrobat, use a program like Preview to print the images into a single image-only PDF (using a PDF-generating 'printer driver'), then OCR that with Acrobat.
posted by James Scott-Brown at 6:47 AM on January 26, 2013


Ooops, sorry wocka, I somehow misread your comment beginning with 'Easy' as offering a way to OCR the PDF. Of course you were positing a way to protect it. So all pls disregard my response to that.

However, I'll repeat, as many have said -- whatever PDF the OP creates, by whatever means, anyone can print it out, one way or another. Then, if the hard copy can be read by a human, it can likely be OCRd and searched.
posted by LonnieK at 1:43 PM on January 26, 2013


Thanks everyone! I will move forward with checking out pdf security, I'd like to make it harder than handing them a print copy that they just go scan in, which is what happens now.
posted by mrs. taters at 10:00 AM on January 28, 2013


« Older Some time back I either read i...   |  What architectural style is th... Newer »
This thread is closed to new comments.