Scanning many old, thin pages
October 5, 2012 7:55 PM   Subscribe

I have inherited a 1,000+ pages of my grandmother's writings. I would like to scan them, OCR them and (after fixing OCR mistakes) share them with the rest of my family online. My question is this: what's the best way to scan so many pages? Also, I should point out that many of the pages are on thin typing paper. Maybe this is carbon paper? or onion skin paper? I'm not sure but I don't want to damage the originals.

To add more detail, I should add that, once the documents are scanned, I know how to do the OCR, and how to create digital documents, etc. What would be most helpful for me would be to figure out a process I can do to get the pages scanned as quickly and cheaply as possible, without harming them.
posted by dylan_k to Technology (15 answers total) 5 users marked this as a favorite
I bet you're looking for a technological answer. Instead I have a human behavior suggestion: create a blog and scan a small bit at a time, once a week. Could be a letter or story or chapter, whatever makes sense based on the nature of her writings. Your family might appreciate the trickle of them, and it should also make it feel less daunting for you.
posted by nadise at 8:20 PM on October 5, 2012 [4 favorites]

One more suggestion. If it were my grandmother, I would love to see scans of her handwritten pages, rather than just a typed-up version. Of course I don't know how easy her handwriting is to read! To build on nadise's excellent suggestion, maybe the blog posts could feature a scan of the original, plus the OCR "translation" for easy reading if people find her handwriting difficult.
posted by Joh at 9:14 PM on October 5, 2012

I'd recommend a document scanning service. The time taken to convert 1000+ pages to images will be substantial. has pricing online which may give you an idea of how much it would cost.

•$0.04 per page: Basic Scan to either PDF or TIFF images, 300 x 300 dpi b/w
•$0.05 per page: Content Search: above plus Full Back-up of searchable PDF
•$0.06 per page: Auto File Name Indexing: all above plus Auto Indexing and creating of searchable PDF
•$0.07 per page: Custom File Name all above plus Indexing on Forms recognition, creation of searchable PDF, validation of indexing fields
•$0.08 per page: Full Document Indexing with Database validation all above plus Database validation and automatic indexing with ODBC database connection

I've never even heard of this company until I did a search, but $40 to scan 1000 pages to PDF sounds like a good investment of money versus time. I'm sure there are many other services available, including some in your area.
posted by blob at 9:56 PM on October 5, 2012 [1 favorite]

Be aware that you don't really need a scanner in order to scan a document - a camera will do just as well. It may be quicker and it is more forgiving if you don't want to damage the originals. This also goes for cameras in newer smartphones. There are various apps available that help optimise images and then either add them as pages to a PDF or make them available for OCR.
posted by rongorongo at 12:24 AM on October 6, 2012

Somehow I see Joh's answer and Dylan's question as bypassing each other. If the pages are handwritten, there can be no OCR. No way.
posted by megatherium at 2:34 AM on October 6, 2012

You have 3 tasks . . .

Copy originals onto regular paper*
Scan regular paper
Transcribe the data (OCR is probably not possible as megatherium says)

*if the paper is nearly translucent, it is probably onionskin and I would think twice about auto-feeding it through a copier. You may be better off copying one page at a time manually. (Lift lid, lay paper, lower lid, hit copy button, lift lid, replace paper etc)

The copier I use at work can scan documents into PDFs and emails them but unless you have access to one of these, you'll have to then scan the copies into PDFs.

You should actually work from copies anyway so you don't risk your originals.

Why not make this a family project? Ask if anyone in the family would be interested in helping, make the copies and then make a second set of copies. Distribute hunks of the work to the group and everyone transcribes their bits, sends them back to you for the online data base.

(if no one will help, you're better off doing weekly chucks but again, work from copies if you have to transcribe so you can make notes as you go. My grandmother's M and N looked very similar in cursive. A post earlier this week cleared up something that I always wondered about. I saw the word "fornication" notated several times in one of her old diaries/calendars usually shorted to "forn". I always wondered but it wasn't like I could ask "Oh um, why was Granny marking down when she had sex?"
Based on what I know of her medical history - she was writing "formication" I am soooooooooo glad I never asked.)
posted by jaimystery at 5:13 AM on October 6, 2012 [1 favorite]

One possibility is a simplified version of a book scanner.

Basically, you'll need a camera with a decent low-light lens and ideally some kind of remote control, a tripod, and light. Rig the camera to point at a table so that the lens is pointing straight on at the page (to minimize distortion). Get as much light as possible onto the page and take the picture with a fast shutter speed to maximize sharpness. The remote comes in handy here because then you minimize the possibility of the camera moving while you're taking this low-light shot; if none is available, use the time delay function of the camera so that your hands are away from the camera when the shot is taken.

It's a little tricky to get set up and will require some experimentation, but once it is set up, you should be able to rapidly "scan" each of the pages of your original text--throw down a page, click the button, next page.
posted by JDHarper at 5:17 AM on October 6, 2012

Brown University Library recently launched Curio, a blog where their Digital Production Services staff talk about the challenges of imaging rare and unusual artifacts which present challenges for digitization. You could call them or email one of the staff members. The second visible post is about copying something that was on thin India paper. I'm sure they could help you figure out a process, and the best thing is that you know they are knowledgeable on the subject. Good luck!
posted by cashman at 7:07 AM on October 6, 2012

I would look into the DIY bookscanning route with a camera, a copy stand, and maybe a poece of glass to flatten pages that need it. A little time googling starting g with "DIY bookscanner," should get you off to a good start.

Once you have your rig in place you should be able to go through pages at a surprising pace.
posted by Good Brain at 8:35 AM on October 6, 2012

If I were doing a project like that, I would use a flatbed scanner (because the book scanner I have access to uses a book cradle and is almost useless for single sheets). The advantage of this would be that the pages are cropped during the scanning stage.

If the documents are on such thin paper, or are faded or written in pencil, you will likely need to monkey with the contrast, gamma, and color balance to make them legible or even visible. The human eye is far more forgiving than any imaging system. This will result in an image that may look nothing like the original, therefore you will probably want to also scan a copy with the color settings flat, so you have what the document looks like, and what it says). Each page may very well need individual color/contrast adjustment.

There's also no guarantee that even the adjusted image will be OCRable. (And as megatherium mentioned, handwritten documents cannot be OCR'd) And even if it is, OCR is rarely 100% accurate. You might want to consider making this a family-wide project, and crowdsourcing OCR correction and transcription of unOCRable pages out to them, then inserting the resulting text into whatever program you're using for your OCR.

Doing it yourself will take a long time, but it might be worth it. Hiring a service will be much faster and cheaper, but you'll get what you get.
posted by Devoidoid at 10:19 AM on October 6, 2012

Agree, a full-bore book scanning rig is probably the wrong tool of the job. Really all thats needed is a camera, some sort of copy stand with appropriate lighting, and a sheet of glass to flatten sheets that need flattening. I suggested starting research by looking at DIY book scanning sites because I think that general approach is going to be more flexible and productive than most of the previous suggestions, and because the technical considerations (including aspects of the post-processing) of this project would be a subset of things people doing book scanning have had to deal with.
posted by Good Brain at 11:45 AM on October 6, 2012

Just a note: If the writing is on onion-skin paper (or other thin paper), put a sheet of regular white copy paper behind the onion-skin paper before you scan it or copy it - or even photograph it. That should make a much better copy without the need for a lot of editing afterward.
posted by aryma at 10:07 PM on October 6, 2012 [1 favorite]

Response by poster: I should add that all the pages were typed, not handwritten, thus the typing paper.

I might also add that I'm trying to avoid photographing or flatbed-scanning each and every page out of a set of nearly 1,000. Then again, the fragility of the paper might require this?
posted by dylan_k at 10:41 AM on October 11, 2012

Response by poster: *I mean, manually scanning/photographing, by myself, that is.
posted by dylan_k at 10:49 AM on October 11, 2012

Response by poster: I just wanted to thank everybody here for their help with this project. If you're curious, I've started blogging about the project. Thanks again!
posted by dylan_k at 11:28 AM on January 6, 2013

« Older from making books to making websites   |   How can I pass as a normal person? Newer »
This thread is closed to new comments.