digitizing a book
September 15, 2020 12:04 PM   Subscribe

I want to digitize an old, rare book (into a text file/epub) without destroying it. I have an iPhone, an ipad, a computer with Windows 10 & Ubuntu, and a lot of time on my hands (so I don't mind physically taking a picture of the text on each page.) I do not want to spend money on this. What's the best combination of programs to use to do OCR on each page and then combine all the OCR'd text into one document that I can then proofread/correct?
posted by needs more cowbell to Technology (15 answers total) 6 users marked this as a favorite
https://www.diybookscanner.org/ should be helpful.
posted by hankscorpio83 at 12:08 PM on September 15 [3 favorites]

I have a very cheap app on my phone, Genius Scan. It works fine, specially if you spend the time. There are probably many others that work just as well. Back in the day when I took pictures from books for my boss, I had a contraption with non-reflective glass and rubber bands and little wedges to keep pages in place without breaking the spine of the books. I don't remember exactly how it was made, but I'm sure you can re-invent it easily. The articles I scan now are not worth the effort.
posted by mumimor at 12:13 PM on September 15 [1 favorite]

Does Genius Scan combine the text from multiple pages into one document easily? I suppose that's the main thing I'm looking for--I've played with OCR apps but I want one that will do reasonably good OCR and also do the work of combining the text from each page into one document.
posted by needs more cowbell at 12:21 PM on September 15

I don't know your level of technical ability. I will note that I installed tesseract -- which is basically Google's OCR, open-sourced -- on my macOS (via Unix) and its accuracy in terms of OCRing is pretty damn amazing. But it was also a technically complicated process.
posted by metabaroque at 12:26 PM on September 15 [1 favorite]

If you just want the text, why not just transcribe?
posted by ocherdraco at 12:27 PM on September 15

Adobe Scan will OCR up to 25 pages per scan, which will be put together into one PDF.
posted by elsmith at 12:38 PM on September 15

The Google Drive app and the MS OneDrive app will take page pictures and join scans into an OCR'd pdf. The PDF may not be all that pretty, but the text should all be there.

Setting up some kind of a rig to ensure even lighting and a consistent camera position would be very helpful.
posted by scruss at 12:50 PM on September 15

Does Genius Scan combine the text from multiple pages into one document easily?

Yes it does.

But the reason I wrote about setting up the rig is that its letter recognition isn't good if you don't have flat pages. It doesn't matter much to me, because I only use it for finding things I already know are in the text, so I can search for different elements till I find the passage I want.

BTW checking the app I think the OCR only comes with the version you pay for. But it is not expensive.
posted by mumimor at 12:54 PM on September 15 [1 favorite]

If you just want the text, why not just transcribe?

I don't want to type 440 pages of text! I don't mind proofreading/correcting 400 pages of OCR'ed text but typing it out would probably re-awaken old carpal tunnel issues.
posted by needs more cowbell at 12:55 PM on September 15 [3 favorites]

Beware that if everything on the page isn't super-clear, the Adobe OCR may end up pretty bad -- for example, identifying blotches on the scan as random punctuation, and putting every few words in separate blocks. It can require a lot of touch-up.

A friend scanned a 40-page typed document about WWII, and I gave up after a LOT of manual clean-up. I mean, at some point I will return to the task, but it's a draaaaag.
posted by wenestvedt at 12:55 PM on September 15

For the job of turning photos/scans of a book into a clean, properly-aligned PDF suitable for OCRing, I can definitely recommend ScanTailor. In a pinch I've used it with handheld photos and it's still done a pretty decent job.
posted by offog at 1:36 PM on September 15

This is probably less useful in covid-times, but many libraries have book scanners available to the general public.
posted by oceano at 2:11 PM on September 15

My take on the transcribe post above was different. Would it be possible to read the book and hove that converted to text? Thinking there are dictation apps that might fill the need.
posted by bhdad at 2:17 PM on September 15

Current librarian, former slide curator who spent a lot of time photographing images from magazines and books.

The ideal setup is a book scanner, of course. Barring that, experiment with lighting to get it lit as brightly and as evenly as possible. The more time you spend on this part getting it absolutely perfect, the less hassle it will be for you down the road. The best lighting will be two very, very strong (stronger than you expect!) lights, one to either side, at a 45 degree angle to minimize shadows from overhead, which is where your camera is. A tripod or other contraption that can hold your camera or phone flat directly above the book is significantly better than your hands. When you go to photograph the next page, move the book, not your camera.

You can purchase rigs like this, and I expect you can find instructions online to build them yourself.

Get your hands on a sheet of glass that's as nonreflective as possible. Cannibalizing a frame with glass like that is possible, or just phone up a local glass store and ask the price--it might be cheaper than you expect to get one slightly larger than your book's pages. Use that to hold down the page you are photographing to make it flat. You may need to get a (gentle!) weight or pad a binder clip to hold the other pages out of the way.

Get a sheet of black paper and put that underneath the page you are photographing. That will prevent text from the underside of the page showing through and, again, make the image as clear as possible. If the ink itself has soaked through to the other side you're out of luck, but use the black paper anyway to kill what you can of the show-through.

We weren't using OCR so I can't help you there, but if you're techie and you've got buddies or family members you can dragoon into helping you correct the text, search online for crowdsourced transcription software. Scribe is one such open source project that you can set up.
posted by telophase at 3:22 PM on September 15 [8 favorites]

The scanning program doesn't need to be the OCR application which doesn't need to be the editing and assembly program. It doesn't matter if the scanning program creates one-page or 400-page files, because you can assemble/rotate/edit the pages in the editing program. Epub authoring is a specific skill you might want to study. One good place is Lynda.com. An active discussion forum about ebook authoring is at MobileRead Forums.
posted by conrad53 at 4:48 PM on September 15 [1 favorite]

« Older Fancy candle samples in Canada? Candle recs also...   |   Scarce goods in September (Covid) Newer »

You are not logged in, either login or create an account to post comments