PDF text recognition
April 18, 2012 10:55 AM   Subscribe

What is the quickest way to digitally black out the social security numbers in 100s of PDFs?

I believe there is a way to get the text to be recognizable and then I could do a quick search for the words that occur before the relevant numbers (which will be different everytime). This could get my cursor to the appropriate place quickly, but then how can I have the black box occur over and over without having to format it everytime? I'm in a crunch and prefer speed over simplicity.
posted by JJkiss to Computers & Internet (9 answers total)
Is it on a form where the numbers are in the same place on every page? If so, then you can create a formatting template in Acrobat with a black bar over the appropriate location.
posted by Jon_Evil at 11:03 AM on April 18, 2012

Speed over simplicity might allow it to be viewed by copy-pasting or by quick scrolling (There was some case a few years back where the police released redacted text messages but by flicking and freezing through the pages we were able to view the redacted information.

If you can highlight and copy the text, you might not be able to black it out in an effective way.

Without more information about how it is formatted and what tools you have at your disposal ...
posted by tilde at 11:06 AM on April 18, 2012

Off-the-wall idea: print, use a marker, scan back to PDFs.
posted by OnTheLastCastle at 11:07 AM on April 18, 2012 [4 favorites]

Yeah, seconding tilde here. Anything that preserves the SSN in the document and depends on acrobat security to keep the document redacted isn't going to fly. There are lots and lots of programs out there that just don't respect acrobat security, not to mention that if the text is stored as text, and not as an image, you could script something up pretty quickly to just batch harvest numbers.
posted by Oktober at 11:11 AM on April 18, 2012

Use the Acrobat formatting template to put a black bar at the appropriate spit on each document. Export each as a TIFF or PNG and then convert each of those to a PDF.
posted by Picklegnome at 11:14 AM on April 18, 2012 [5 favorites]

Is it on a form where the numbers are in the same place on every page? If so, then you can create a formatting template in Acrobat with a black bar over the appropriate location.

If you do this, or any kind "mark out the number with a black box" strategy, there is a very good chance that the SSNs are still lurking in the file somewhere, easy to find for someone technically proficient even if you can't see them on the screen. IF you plan on releasing these files, please take the extra step of rendering them to a bitmap format (like PNG or TIFF), where what-you-see-is-what-you-get, and then back into PDF. Yes, this will cause the PDFs to print badly, since you're replacing all of the nice vector-based drawing commands with a fixed resolution image.
posted by qxntpqbbbqxl at 11:17 AM on April 18, 2012 [1 favorite]

Yes, to png and back to pdf is something I will most definitely do, but just so nobody worries these aren't being released to the public, only executive at a tv network, lol. The ssn is not in the same place everytime, there are 4 different versions of this doc I'm going through.
posted by JJkiss at 11:22 AM on April 18, 2012

Thanks for the update, JJkiss.

Pickle, qxntpqbbbqxl - thanks for something new :) I luckily never had to do such redacting, good to know there is an easy right way to do it.
posted by tilde at 11:25 AM on April 18, 2012

What version of acrobat are you using? In the version I have you use the text recognition tool (Tools > Recognize Text > In This File) then can search and redact for patterns (like SSNs) as described here. I'm sure there's then some techy way to make sure the redaction is secure but I'd probably just print it out and rescan to be sure.
posted by wuzandfuzz at 12:17 PM on April 18, 2012 [2 favorites]

« Older Can love actually solve our disputes?   |   Fodder On My Wings Newer »
This thread is closed to new comments.