Performing OCR on a Franktur/Blackletter PDF and getting a searchable image PDF
April 28, 2012 3:17 PM   Subscribe

I have Adobe Acrobat X Pro on Windows 7. Is there any free or inexpensive way to use OCR to create searchable image PDFs from image-only PDFs of texts written in German in Fraktur/Blackletter script?

I have downloaded the Tesseract OCR engine from Google; and while I have found a couple of GUIs for it, they all just produce plain text dumps. I want a program that will create a PDF with the original image as a layer as well as searchable text, which I can highlight/underline/comment as with a PDF created from applying OCR to something written in standard Latin script.

PS: I am aware of the ABBYY Historic OCR software, which, besides being incredibly expensive, is also limited to a certain page limit.

posted by dhens to Computers & Internet (14 answers total) 1 user marked this as a favorite
Evernote will probably do this for you. You'll have to add them as notes, then can export them as Searchable PDFs after it processes them (on its server). Should be fine to use the free version, but there's a limit on how much you can upload.
posted by iamscott at 3:21 PM on April 28, 2012

It used to be that if you uploaded a document to the (free) Any2DjVu service you could tick a box that would cause it to be (rather poorly) OCRed and a text layer inserted into the .djvu file, which if I recall correctly would survive the .djvu being converted to a PDF.

I think the DjVuLibre command line tools that the service is based on allow you to add and remove text layers from .djvu files, which might let you stick in the Tesseract OCR text if that comes out better and convert to PDF.
posted by XMLicious at 3:29 PM on April 28, 2012

I want a program that will create a PDF with the original image as a layer as well as searchable text, which I can highlight/underline/comment as with a PDF created from applying OCR to something written in standard Latin script.

PDF-XChange Viewer will do this for free, even without paying for the pro version.
posted by Inspector.Gadget at 3:48 PM on April 28, 2012 get the drift. had some extra words in there.
posted by Inspector.Gadget at 3:49 PM on April 28, 2012

Response by poster: To reiterate, I am looking for something that can create a searchable PDF from BLACKLETTER script ("Gebrochene Schrift" in German, what is commonly called Gothic or Fraktur in English, see here for examples), NOT LATIN SCRIPT. All of the solutions posted so far just seem to work for regular Latin-based scripts (which I can already do with Adobe Acrobat X Pro).

I may have to look at the DjVuLibre which XMLicious posted further. I will still welcome further suggestions!
posted by dhens at 4:17 PM on April 28, 2012

Assuming you can train tesseract/Cuneiform (or find training files) for Fraktur, both packages output hOCR which you can combine using hocr2pdf to create a searchable pdf, as described here: How to extract text with OCR from a PDF on Linux?

It may be a bit of a pain installing these tools on a non-Unix environment. I've been using this workflow to fix my bank's amazingly bad PDF statements.
posted by scruss at 4:23 PM on April 28, 2012 [1 favorite]

Response by poster: Thanks scruss, that looks interesting. Tesseract works fairly well on Fraktur; when I installed it from Google Code I included the files for Fraktur. When I use gImageReader (a GUI tool that uses Tesseract to create plaintext dumps) on PDFs of images with text in Fraktur, it works well.

What I am looking for then it seems is:
1.) A GUI for outputting hOCR with Tesseract from a multipage PDF.
2.) A GUI for integrating the hOCR with the PDF.
posted by dhens at 5:19 PM on April 28, 2012

Not really a gui guy, sorry.

Another (command-line, sorry) possibility is PDFBeads.
posted by scruss at 5:29 PM on April 28, 2012

Response by poster: I am going to look at this tutorial on using PDFBeads under Windows tomorrow. As always, if anyone has any more user-friendly ideas, I'd love to know about them!
posted by dhens at 8:32 PM on April 28, 2012

Response by poster: I tried installing PDF beads as shown above; I got the program to install but
1. I tried to install "hpricot" and that apparently is not working and
2. I don't know the commands for actually using PDFBeads...

Again, does anyone have a more user-friendly solution?
posted by dhens at 8:12 AM on April 29, 2012

Response by poster: Update: It looks like Tesseract + hOcr2PDF might work, but right now I am having some problems. I have outlined the problems here; I am waiting to hear back from someone on that forum.
posted by dhens at 6:06 PM on April 29, 2012

Response by poster: Of course, if someone here can help me with the problem I have outlined in the link in my previous post, I would be grateful for that, too!
posted by dhens at 12:29 AM on April 30, 2012

Response by poster: Update (again): The developer of the software has contacted me and he said he is working on the issue. Thanks everyone!
posted by dhens at 10:11 PM on May 1, 2012

Best answer: SOLVED: The developer of hOcr2PDF.NET was kind enough to work with me on this.

Make sure you have Microsoft.NET installed.

The first thing you need to do is download and install the "Tesseract" OCR engine. The is the software which actually does the text recognition. It doesn't have a graphical interface, nor can it put the text back into the PDF as a separate layer, so we will be installing software that in a second.

Download the installation file for Tesseract-OCR here.
When you install it, be sure that "Add to PATH" is selected (it should be by default).
You MUST also check "Download and install German (Fraktur) language" under the "Language data", which is NOT selected by default.

Once this is installed you can install the program which provides a GUI and actually puts the text back into the PDF as a layer. It is called "PDFCompressor" and the developer of the software worked with me on getting it to work.

Download the newest file here. Inside of that archive, find the "PDFCompressor.7z" archive. (You'll need 7Zip for this, obviously). Extract ALL THE CONTENTS to a (preferably new) folder somewhere where you want to keep it. The file you want to launch is "PDFCompressor.exe". You might want to make a shortcut for that.

Open the "Options" menu. The most important are that you choose "Tesseract" as the Ocr engine and that the "Language:" field says "deu-frak" (no quotation marks). Hit "OK."

You can add PDFs (they will queue up if you wish, so you can apply OCR to several at once) using the "Add Files" button.

Select where you want the PDFs to be saved to (I suggest somewhere different than where the original files are; otherwise you will get a file called "OriginalFileName_CV.pdf") by either typing in the location by hand or choosing it by pressing the ".." button.

It can be rather slow for multi-page PDFs but the results are pretty good!

Presumably this will work for other languages that Tesseract can recognize that are somewhat obscure, for example Danish written in Fraktur script; just make sure to download that when you install Tesseract and to put in the correct language code in the Options menu of PDF compressor ("dan-frak" in that case).

PS: I don't know how well this program will work on any PDF which might already have had some OCR done to it or text annotations on it. Your mileage may vary.

PPS: Thanks to scruss for getting me on the right path by suggesting hOcr tools and for offering to look at the PDFs himself.
posted by dhens at 8:15 AM on May 6, 2012 [1 favorite]

« Older Filmgaze   |   Help me a find a replacement for a broken vintage... Newer »
This thread is closed to new comments.