Scanning Academic Journals
July 29, 2008 12:43 PM   Subscribe

I'm looking for advice/links on suggested quality/settings for the scanning of academic journals for archiving. The more detail the better.

Looking at scanning a lot of academic journals (containing text, diagrams and images) using a professional service (which has been found).

Obviously only want to do this once, so looking for advice on suggested resolutions, file formats, greyscale or colour etc. Essentially this needs to be archival/library quality, that could be turned into other formats (searchable .pdfs or .txt files) as appropriate (any thoughts on advantages of each of those also welcome).
posted by drill_here_fore_seismics to Technology (4 answers total) 2 users marked this as a favorite

Having done something similar I can say: if you can afford it do full-color TIFFs (but not with embedded compressed JPEGs) at 600DPI. Those never get old. Get them all scanned on the same fancy scanner if you can.

Some other random notes:

If you cut the spines of the volumes you'll save a ridiculous amount of money sheet feeding and will get better scans because you won't have gutter shadow, but be CAREFUL. One mistake means a lot of hand-scanning (assuming you have duplicates available). It's worth cutting books by hand, page by page, if they're tightly bound.

The Fujitsu scanners (5150c is highly recommended) deal well with thin or old paper stocks; very little tearing (one out of 5,000 pages or so would need a little Scotch tape). I swear by them. (Although I guess you have a service bureau for that.) Make sure the vendor shows you lots of samples early on.

If storage space is an issue (TIFFs are huge) I'd stay with color and use very lightly (90% or so) compressed JPEG at 600DPI instead of TIFF. Some people would find that advice horrifying--choosy librarians prefer TIFF--but at that compression with that resolution I can see the grain of the paper, and can see offset printing artifacts at 100% zoom. But JPEG compression artifacts are very, very few between. (JPEG2000 is even better in this regard but is slooooow. Worth investigating.) But that is not the conventional wisdom, which calls for zillion-bit TIFFs.

Once images are acquired store them, then make copies downsampled to 300DPI; otherwise everything takes forever even on fast hardware. Batch the downsampled images into an OCR program of your choosing. (OmniPage: best overall but IMPOSSIBLY broken on big batches (memory leaks); Abbyy: great OCR, nasty deskew; ReadIris: okay OCR, great deskew; there are expensive commercial batchers as well). PCs are where it's at for OCR, not Mac. Save image-on-text.

Once you have OCRed pages you can associate pages with bibliographic data, extract deskewed images for web-based viewing or extract text programmatically to stow it into your search engine.

If you don't have bibliographic data in a digital form there's a lot of other problems to solve that are out of scope here; if you do then be on the lookout for exciting or interesting pagination issues as you line up the journals, especially in older volumes--unnumbered sequences, starred repeats of certain pages bound in after publication, and so forth. QA is extremely important, although readers given an easy feedback form will find the many edge cases that emerge--ripped pages, blow-in cards, etc. How much you do depends on your budget, of course.

If there's limited financial return expected on your side don't be afraid to talk to people at JSTOR. It's possible they'll do it for you and let you have a copy for your own purposes.

Get used to doing a lot of simple math for time estimates. I was dealing with a quarter-million page archive mostly on my own. No problem! I'll just give each page a minute of my time. Except that's 520 8-hour workdays. Etc., etc. At that scale you need to estimate human time, CPU time, etc, in a different way than on many other projects. People who do film/3D work are good at helping you think through those sorts of problems if you have any around.

It's very cool to watch digital archives emerge. Enjoy!
posted by ftrain at 2:51 PM on July 29, 2008 [1 favorite]

Thanks for the information so far, especially ftrain's detailed post. Any thoughts on whether greyscale is acceptable for text/diagrams, or if full colour is really the way to go would be appreciated. Also the scanning service is offering to supply technical metadata with the images - worth doing?
posted by drill_here_fore_seismics at 10:38 AM on July 30, 2008

Technical metadata (to me) means "scanned at X DPI by scanner type Y at hour Z using color settings blah blah." It should really just come with the scans unless they're doing something special; i.e. shouldn't be pricey. But they might mean something else by the term. Ask them to explain what they mean and how that data is most often used.

Grayscale is fine if the material is all black and white. Makes managing levels much easier for final conversion. I was scanning old issues of an art/lit magazine so color was incredibly important and I'm glad I did it that way (as are the readers). But if you can't imagine a scenario where color would provide valuable information to the reader then go grayscale. You'll save a ton of space and time.

1-bit bitonal I'd avoid unless you're only interested in OCRing.
posted by ftrain at 3:51 PM on July 30, 2008

« Older How to evaluate potential employment at an LLC   |   Because Ray's Occult is in New York Newer »
This thread is closed to new comments.