What is the fastest way to trasfer some pages from an old physical book to a searchable computer text file?
December 5, 2006 2:46 PM   Subscribe

What is the fastest way to trasfer pages from an old physical book to a searchable computer text file?

What I have been doing is I use my cheap $50 scanner and Adobe Acrobat Professional to scan it page by page then save the file as a pdf file and then also as tiff file and then use Omnipage to process the tiff files and then copy the omnipaged results to a txt file and then manually clean up the excess garbage pieces and then formatting them as close to original page as possible.

The most time consuming part of the task for me is the initial scanning page by page manually on a scanner.

Is there a faster way getting from a to z?

Am I missing anything?

How does google books (books.google.com ) do it when they digitize out of print books?

How do those people in the law firms do it with their massive amount of paper work from years past that need to be stored digitally and made searchable inside the computer?

Would you do things differently if you have a bunch of loosleaf paper compared to a hard bound book?

Please enlighten.

Thank you very much.
posted by cluelessguru to Computers & Internet (14 answers total) 2 users marked this as a favorite
Response by poster: I found the following posts but they do not really answer my questions, because I am into scanning books of many different sizes





posted by cluelessguru at 2:52 PM on December 5, 2006

You can hire a vendor to do it for you. They can scan the book and provide OCR (optical character recognition) or OWR (optical word recognition). You can probably get it done for around $.06/page. Search for litigation service vendors or imaging vendors in your area.
posted by cwarmy at 2:57 PM on December 5, 2006

When I worked for an equity branch of a university and we needed to convert books from paper to electronic format so that students with vision impairment could make use of them, we would cut the books so that the spines were removed and all the pages were separate, and then we used an automatic sheet feeder on the scanner and it went directly to OCR (something like omnipage) and then some poor schmuck (ie me) got the originals and the electronic copy and would wade through it manually finding the errors.

Where possible, we would contact the publishers and ask if they had an electronic copy that we could make available to the student. Sometimes they would help, but not often.
posted by b33j at 2:59 PM on December 5, 2006

Response by poster: Thank you very much for your all great answers :)

I forgot to say I want to do this as cheaply and as time saving as possible.

$0.06 a page ? This sounds like something I can live with.

Any recommended mail order service providers who charge $0.06 or less a page for scanning books and cleaning up errors?

Does $0.06 a page include everything , i.e. including scanning and then manually going through the pages to clean up the errors?

Or just scanning.

If just scanning , $0.06 a page is still pretty good.

Would you recommend any mail order service providers who do this?
posted by cluelessguru at 3:12 PM on December 5, 2006

I realize you're not running Linux, but here's my experience anyway.

I've used OCRShop XR (30-day free trial, and I'm sure somewhere on the intertubes are ways around that) for linux which has a fairly easily-automatable command line version. I have a shell script that scans a page and passes it to ocrshop for processing and then creates a pdf of the recognized text.

If you're doing a lot of this, it may be worthwile to setup a vmware image of linux or something. The command-line use makes automating things so much easier.
posted by Skorgu at 3:16 PM on December 5, 2006

Response by poster: Yes, I am the schmuck who only knows how to use Windows XP :)

But it is great to know what other operating systems can do what Windows cannot do .

Because I need all the help I can get.

And if I am desperate enough I will learn other operating systems.

Thank you very much.
posted by cluelessguru at 3:36 PM on December 5, 2006

>How does google books (books.google.com ) do it when they digitize out of print books?

Not that this will help you, but having recently read the book "The Google Story", they use highly specialised and expensive technology they spent a lot of time and money developing -- specifically so that they wouldn't have to take apart the books in any way (see b33j's method).
posted by AmbroseChapel at 3:56 PM on December 5, 2006

Acrobat Pro can do OCR, and it's very good about it. My friend cuts his textbooks, scans all the pages in (in a bulk scanner) and then OCRs them via Acrobat. We have a open book test class this semester, and searching for random words through the textbook is 1000 times better than any index could be.
posted by cschneid at 4:58 PM on December 5, 2006

Response by poster: cschneid ,

It is true that Acrobat Pro (I am using version 7) can do OCR.

But I find that omnipage pro 14 is still better , overall.

But what omnipage cannot give me, sometimes the OCR function of Acrobat Pro 7 can.

So I will highly recommend use both of them at the same time because they seem to complement each other in a lot of instances.

Thank you very much for all your wonder answers so far. Keep them coming.

I appreciate it.
posted by cluelessguru at 6:32 PM on December 5, 2006

If the book is in the public domain (and interesting enough) you might consider sending it off to Distributed Proofreaders. (And don't underestimate 'interesting'; there are all kinds of people who proof for them, with all kinds of professions and hobbies.) Of course, if it's proprietary, you're out of luck there. How old is your 'old book'?

I confess I don't know how much time Distributed Proofreaders needs for overhead (cutting the spine, scanning the book, initial OCR, and post-processing), but their page-per-day rate is fearsome, and they release hundreds of e-texts a month, so I don't guess it's too bad. Might not be the fastest overall solution, but great public benefit.
posted by eritain at 8:41 PM on December 5, 2006

The coolest thing about Acrobat and PDFing documents is that it keeps the original scanned image, and hides the OCR'd text underneath. It lets you highlight, but all formating, images, and everything stay put and are still visible.

If you find yourself doing this semi-regularly, it would probably be worth it to invest in a bulk scanner. I don't know what my friend has, but it was only $150 or so. It's worth saving the hassle when you can drop in 50 pages at a time and let them go.
posted by cschneid at 12:00 AM on December 6, 2006

Response by poster: Thank you very very much for the great suggestions so far.

Much much more often than not, I am only interested in a couple of pages here and there throughout the book, amounting to no more than a chapter or two.

If I want the whole book, I prefer to buy the ebook version. It is often cheaper and less time consuming.

Looks like a bulk scanner is the next major purchase on my shopping list :)
posted by cluelessguru at 7:47 AM on December 6, 2006

Response by poster: By the way, I have found the following link very useful:

posted by cluelessguru at 10:55 AM on December 11, 2006

Response by poster: It is true that Acrobat Pro (I am using version 7) can do OCR.

But I find that omnipage pro 14 is still much better .

Acrobat Pro 7 OCR managed to miss a couple of pages whil Omnipage Pro 14 did not fail.
posted by cluelessguru at 2:43 PM on December 14, 2006

« Older Much to do before Christmas   |   X-Mas webcam open to the world. Newer »
This thread is closed to new comments.