How to automate searchable PDF creation with automatic document feeder, PaperPort and OmniPage?
February 1, 2007 2:38 AM   Subscribe

How can I automate PDF creation with my automatic document feeder, ScanSoft PaperPort and OmniPage? I need to have PDFs that look like the original document and are searchable (not image only).

I was fed up with scanning single pages with my CanoScan scanner, so I bought a Brother MFC-5440CN since it offers an automatic document feeder (35 pages).

I can scan pages to PDF alright, but those PDFs are not searchable (you can't search for text within them). My CanoScan scanner automatically created searchable PDFs, but with this Brother MFC I seem to be in need to fumble together the solution myself...

I even updated the bundled PaperPort and OmniPage to the latest versions, PaperPort 11 and OmniPage 15 - but still can't get it to work.

I'd be very happy to get a detailed "how to" for what I'd need to do to get this working, search engines couldn't help me.
posted by ideaguy to Computers & Internet (9 answers total) 3 users marked this as a favorite
I am not familiar with OmniPage, so I cannot address that, but I can advise that you can retrofit your image-only PDFs, doing the OCRing after the fact, by using PDF Transformer by Abbyy.
posted by megatherium at 4:38 AM on February 1, 2007 [1 favorite]

OmniPage (at least the latest version) will OCR PDF files. I'm pretty sure that you can just use OmniPage to do the scan in the first place and it'll automagically OCR the file and make it searchable, but I haven't used it in a while, so I could be wrong.

My clients all gave up on OCR several years ago, so I'm not too familiar with the latest version beyond what I've read on Scansoft's site.
posted by wierdo at 7:22 AM on February 1, 2007

There are a number of products that will do this. The leaders in the field of OCR right now are pretty much Abbyy and Scansoft. I am willing, but contractually unable to tell you which is the better product in my opinion. If you don't have it, I'd try to get my hands on a demo copy of their software to try them out.

I can tell you that my company makes .NET components that include scanner control, OCR and searchable PDF generation. These are not end user applications - they are components that can be assembled into an end user application. This is probably not what you want, although it could most assuredly be used to build what you need. If you're interested, I'll post company info.
posted by plinth at 8:31 AM on February 1, 2007

I have very similar needs, and I got email from Nuance claiming that Omnipage 15 Professional adds the OCR ability to Paperport, so double-check everything first. It may be just Omnipage Professional.

I wasn't sure of the workflow involved with OmniPage, and wasn't about to shell out money to find out, so I was looking at another tool called FileCenter (approx. $100) which scans into PDF searchable anyways. It can also convert TIFFs to PDF searchable. It has its own OCR engine, and can utilize OmniPage's and MS Office 2003's also. You can try it for 30 days.
posted by blue_wardrobe at 9:00 AM on February 1, 2007

I used Abby FineReader for this purpose about 5 years ago. Technology has probably improved, but I found that FineReader did an excellent job overall.
posted by nickerbocker at 9:05 AM on February 1, 2007

Response by poster: Thanks for all your suggestions, I'll experiment some more and will keep you posted here. PDF Transformer sounds like I should try it too.

By the way, is there a tool to reduce the size (kb) of a PDF file while keeping it intact? It should not remove stuff like searchable text.
posted by ideaguy at 3:00 AM on February 2, 2007

PDF reduction is something that, honestly, is best done in the authoring process. For example, if you're dealing with 1-bit images with text-under, you really want to make sure that the generation software is using JBIG2 compression instead of, say, CCITT 3 or 4. This alone should give you the biggest gain right out the gate, since your images will be your biggest chunk of data. You also want to make decisions as to whether or not the images should be resampled to a lower resolution. These can be done post-processing, but really are better to be done from the head end.

As far as other things are concerned, anything else is a small incremental win. In my own PDF generation code, I do my best to make sure that as many internal streams are filtered through an LZW class compressor. For interchange, I usually leave ASCII85 filters on so the files are still technically text files.

Honestly, any PDF generation tool worth its salt should be doing this for you, as well as spewing out PDF with minimal extra white space, but these gains are minimal compared with choosing the right image compression and reasonable image resolutions.
posted by plinth at 7:49 AM on February 2, 2007

I believe that Omnipage has a watch folder. Then you can set up one of their macro workflows to OCR -> PDF when image files from the scanner are sent to that watch folder. That's how I've done something simliar in the past.
posted by i_am_a_Jedi at 7:54 AM on February 2, 2007

Response by poster: I tried PDF Transformer - but sadly I didn't like the resulting PDF files, the quality isn't that good. Maybe I did something wrong, but since there aren't hundreds of options I doubt that.

I'll play around with OmniPage some more! :)
posted by ideaguy at 10:28 AM on February 6, 2007

« Older Hotel thieves?   |   Advice for someone wanting to move to Australia... Newer »
This thread is closed to new comments.