Best digitizing options for dual-language documents?
December 20, 2009 8:11 PM   Subscribe

Options besides PDF for digitizing dual-language books?

I have some teaching guides I use in my work which are quite bulky and I would like to digitize them. They are fairly simple-looking documents except that the text is mixed between English and French. (e.g. it has prompts in English telling you what to say in French to the kids when you are teaching, for example "say 'je suis ici' while pointing at yourself.")

I am using a Macbook, and the scanner I got was a basic model, a Canonscan 100 scanner on sale. I tried to scan a few pages, and it was a bit of a mess. I got an okay picture when I scanned it as an image, but all attempts at PCR extraction were dismal. When I set the scanner to OCR mode and the language was English, I got gibberish. When I set it to French, things improved a little and it got much of it, but the text still needed a lot of cleaning up.

I thought maybe it was just that the software which came with the scanner was not that great. So I downloaded a few utilities which claim to extract text from PDFs. They had great reviews. They totally choked on the French parts.

The PDF looks fine (I made a two-page sampler for testing purposes), but displays a bit too small for easy reading on my Sony Reader. I uploaded it as a PDF, LRF and epub in separate files onto my Reader for testing. The epub could not zoom at all (i.e. the page stayed looking the same no matter what). The LRF looked just like the PDF on lowest zoom but when I tried to zoom in, the text got garbled as it had when I tried to extract it from the PDF.

So, there are three possibilities here:

1) The scanner is not that great
2) The scanner is fine and I just need better software
3) Dual-language files are too hard and I am stuck with PDF

What do you think? Is there anything I can do here, or will I go to all this work just to wind up with itty bitty text in a PDF file? If so, it may not be worth scanning them all...

Good options for digitizing documents like these? The books weigh several pounds each, and there are 7 in the set, so it's a ton of weight to carry around with me...
posted by JoannaC to Technology (6 answers total)
I'm a little confused. Are your PDFs just images, or are they OCR'd text, or both? If the PDF contains and displays text, you should be able to control the size of that text in the software that creates the PDF.

I suspect that the scanner is fine. You may need to buy better OCR software.

You might also have better luck displaying these PDFs on a Kindle DX. I have that and a Sony Reader, and the Kindle DX is much better for displaying PDFs.
posted by me & my monkey at 8:35 PM on December 20, 2009

Response by poster: The original document is a spiral-bound book. The software which came with the scanner can scan it as a picture (jpg or pdf) or OCR it in which case the 'text' shows up as a plain text window which pops up and you are supposed to cut and paste it into your doc.
posted by JoannaC at 8:38 PM on December 20, 2009

Are you able to make a few pages of your raw, non-OCR'd scanned images available to play with?
posted by flabdablet at 12:06 AM on December 21, 2009

Response by poster: Sure, flabdablet. Here.
posted by JoannaC at 12:19 AM on December 21, 2009

I think ABBY Finereader lets you specify page regions according to language, but it's tedious as hell.
posted by fake at 2:43 AM on December 21, 2009

If you have MS Office, hidden away in the MS Office Document Imaging is OCR running an extremely capable OmniPage engine. Multilingual OCR is hard; there will be a lot of retyping unless you can define areas in different languages.
posted by scruss at 5:16 AM on December 21, 2009

« Older No Title   |   Sketchy journalism & littering Newer »
This thread is closed to new comments.