Sort, OCR, export text from PDFs
March 17, 2008 11:13 AM Subscribe
I am in need of a server-side Linux or Unix-based software solution that will sort uploaded PDF files that can be PDF-native (that is, created in such a way that the text in the PDF is freely copyable), PDFs with embedded text over images (usually the result of a previous OCR job), and PDF-scanned, which are PDFs containing no text, only scanned images. The PDF-native files and PDFs with embedded text it will extract text from, the PDF-scanned files it will then OCR and export that text.
This means it should not be Windows-based, it should not run on the client or desktop side, and it should be scriptable.
This means it should not be Windows-based, it should not run on the client or desktop side, and it should be scriptable.
Tesseract also looks like it will do a good job, though it appears to be a bit more fiddly. A 'solution' which inspects the PDF files, decides which of the three types they are, then extracts the text from them, shouldn't take more than an hour or two of work by an experienced coder.
One consideration should be the accuracy that you expect from the OCR of the PDF-scanned files, and how OCR-friendly these files are.
posted by onalark at 11:27 AM on March 17, 2008
One consideration should be the accuracy that you expect from the OCR of the PDF-scanned files, and how OCR-friendly these files are.
posted by onalark at 11:27 AM on March 17, 2008
gocr's output is entirely dismal. I haven't tried tesseract since it would actually build on anything.
Vividata might have a solution for you, but it will be expensive. Nuance claim they have a linux SDK for OmniPage 15. ABBYY may even have a server-based FineReader engine.
posted by scruss at 12:31 PM on March 17, 2008
Vividata might have a solution for you, but it will be expensive. Nuance claim they have a linux SDK for OmniPage 15. ABBYY may even have a server-based FineReader engine.
posted by scruss at 12:31 PM on March 17, 2008
This thread is closed to new comments.
posted by onalark at 11:22 AM on March 17, 2008