Digital Photography Basics and OCR
March 24, 2007 10:24 AM Subscribe
Figuring out digital photography to support an OCR project
My work just bought a new scanner that uses two SLR cameras to do the imaging. I've been charged with setting up the system for using this scanner to digitize and then OCR a book collection. I am a total novice about digital photography, so I am looking for a good site/book/paragraph to explain the basic details that I need to know (like lighting, lenses, camera speed etc) for setting up the cameras. Secondly, I am also trying to figure out how to set up the dimensions and formats for the images in order to produce an optimal PDF that can go through the OCR process. At the present, the cameras are outputting the images as JPEGs, each about 3MB/page and roughly 3000 x 2000 pixels. I know that these dimensions are too large for a letter sized PDF, so I'll need to resize them in a batch process in Photoshop to the correct dimensions without losing clarity. The minimum level dpi for OCR is about 350, so I'm wondering how to ensure I hit that threshhold after the resizing.
My work just bought a new scanner that uses two SLR cameras to do the imaging. I've been charged with setting up the system for using this scanner to digitize and then OCR a book collection. I am a total novice about digital photography, so I am looking for a good site/book/paragraph to explain the basic details that I need to know (like lighting, lenses, camera speed etc) for setting up the cameras. Secondly, I am also trying to figure out how to set up the dimensions and formats for the images in order to produce an optimal PDF that can go through the OCR process. At the present, the cameras are outputting the images as JPEGs, each about 3MB/page and roughly 3000 x 2000 pixels. I know that these dimensions are too large for a letter sized PDF, so I'll need to resize them in a batch process in Photoshop to the correct dimensions without losing clarity. The minimum level dpi for OCR is about 350, so I'm wondering how to ensure I hit that threshhold after the resizing.
Via the 'Content Provision' section of the forums at pgdp.net (you need to be a member to read them, I think), I found this which is a very basic starting point that might be helpful to you?
I don't have any OCR software myself, but when I scanned a project for PG, they wanted jpegs (B&W, 300 dpi) rather than pdfs. Are you sure you need to convert the files? Seems like an extra level of complication to me.
posted by Lebannen at 11:46 AM on March 24, 2007
I don't have any OCR software myself, but when I scanned a project for PG, they wanted jpegs (B&W, 300 dpi) rather than pdfs. Are you sure you need to convert the files? Seems like an extra level of complication to me.
posted by Lebannen at 11:46 AM on March 24, 2007
Edit: I'm an idiot. I had png files, not jpegs, except for the illustrations.
posted by Lebannen at 11:49 AM on March 24, 2007
posted by Lebannen at 11:49 AM on March 24, 2007
Response by poster: Thanks - I should have mentioned that I'm already using ABBYY (the Tips & Tricks for Digital Photography is something I overlooked). The scanner itself is from Atiz - you're basically buying a frame for the cameras, book cradle, and the software that controls the cameras and manages the images.
Unfortunately the final product has to be a full text searchable PDF, so there's no way of getting around the conversion process. I've had middling success in the past with batch converting 600 dpi TIFFs into PDFs, where we lost some image quality and resolution.
posted by gov_moonbeam at 12:36 PM on March 24, 2007
Unfortunately the final product has to be a full text searchable PDF, so there's no way of getting around the conversion process. I've had middling success in the past with batch converting 600 dpi TIFFs into PDFs, where we lost some image quality and resolution.
posted by gov_moonbeam at 12:36 PM on March 24, 2007
There's a couple of ways of converting to PDF: by OCR or just by including raster images. Since you need a text-searchable document, you have to go the OCR route: you can't search an image for text. Given that you have to OCR and the minimum resolution for OCR is 350dpi, why do you want to scale the images down before OCRing? It's not like there's any such thing as "too big" for the OCR engine, is there?
Once it's been through OCR, all you have is text and layout; the original image is irrelevant so it's resolution is moot as long as it's enough for the OCR to work... unless the book has illustrations, in which case you should keep as much resolution as possible.
Anyway, if these are 5x8" pages, you'll get only 400dpi anyway. So even if you shrunk it to 350dpi, you're only reducing the size by 12.5% in each dimension, a 24.5% saving in area, which is practically pointless. I say: don't resize the JPEGs, leave 'em as is and just crop to the page size. As for batch conversions, ImageMagick is your friend.
If the pages you're photographing are 8.5x11" then you're only getting 235dpi from the camera. Definitely do not shrink the images in that case, though I'm sure it'll be enough resolution for the OCR to work.
As for camera setup... do you have the lenses already? You want to get a prime (not zoom) macro lens, either 50mm or 100mm will be fine; the latter will just require that the camera is about 2x as far from the page and the 50mm will be cheaper. Set the aperture to about 2 stops smaller than maximum, this is usually where you get maximum sharpness; e.g. if the max aperture is f/2.8 (highly likely), you want to select f/5.6 (make sure the camera is in aperture-priority or manual mode). Make sure the camera is centred above the page, not at an angle and ensure that the page is perfectly flat.
Use two off-camera flashes, each pointing at the page with 45 degrees elevation, one from the left and one from the right. Exposure time will be the "X-Sync" speed of the camera, i.e. the maximum shutter speed wherein both curtains are fully open at the same time, i.e. the shutter is not acting as a moving slit. The camera should automatically control the flash power to give you a good exposure; if it over-exposes even with the flashes on minimum power then move them further away from the page to reduce the amount of light. Use the minimum ISO the camera supports, probably 80 or 100.
If you can, surround the whole rig with bright white cardboard to diffuse the light from the flashes. A good thing to do is provide diffuse indirect light by bouncing each flash off a 20x30" piece of white card so that the light is no longer a point-source. That way you could get away with a single flash maybe.
If you want a high scanning rate, you'll want mains-powered flashes for the fast recycle time and no changing of batteries. Another option is to go for bright incandescent lights; in fact that'd likely be cheaper. In that case (hot lights), the exposure is defined by the shutter speed; the camera should make a good guess at it (might need +1 exposure compensation because of the white page) or you can get it right by experimenting in manual mode.
posted by polyglot at 6:09 PM on March 24, 2007 [2 favorites]
Once it's been through OCR, all you have is text and layout; the original image is irrelevant so it's resolution is moot as long as it's enough for the OCR to work... unless the book has illustrations, in which case you should keep as much resolution as possible.
Anyway, if these are 5x8" pages, you'll get only 400dpi anyway. So even if you shrunk it to 350dpi, you're only reducing the size by 12.5% in each dimension, a 24.5% saving in area, which is practically pointless. I say: don't resize the JPEGs, leave 'em as is and just crop to the page size. As for batch conversions, ImageMagick is your friend.
If the pages you're photographing are 8.5x11" then you're only getting 235dpi from the camera. Definitely do not shrink the images in that case, though I'm sure it'll be enough resolution for the OCR to work.
As for camera setup... do you have the lenses already? You want to get a prime (not zoom) macro lens, either 50mm or 100mm will be fine; the latter will just require that the camera is about 2x as far from the page and the 50mm will be cheaper. Set the aperture to about 2 stops smaller than maximum, this is usually where you get maximum sharpness; e.g. if the max aperture is f/2.8 (highly likely), you want to select f/5.6 (make sure the camera is in aperture-priority or manual mode). Make sure the camera is centred above the page, not at an angle and ensure that the page is perfectly flat.
Use two off-camera flashes, each pointing at the page with 45 degrees elevation, one from the left and one from the right. Exposure time will be the "X-Sync" speed of the camera, i.e. the maximum shutter speed wherein both curtains are fully open at the same time, i.e. the shutter is not acting as a moving slit. The camera should automatically control the flash power to give you a good exposure; if it over-exposes even with the flashes on minimum power then move them further away from the page to reduce the amount of light. Use the minimum ISO the camera supports, probably 80 or 100.
If you can, surround the whole rig with bright white cardboard to diffuse the light from the flashes. A good thing to do is provide diffuse indirect light by bouncing each flash off a 20x30" piece of white card so that the light is no longer a point-source. That way you could get away with a single flash maybe.
If you want a high scanning rate, you'll want mains-powered flashes for the fast recycle time and no changing of batteries. Another option is to go for bright incandescent lights; in fact that'd likely be cheaper. In that case (hot lights), the exposure is defined by the shutter speed; the camera should make a good guess at it (might need +1 exposure compensation because of the white page) or you can get it right by experimenting in manual mode.
posted by polyglot at 6:09 PM on March 24, 2007 [2 favorites]
One addition to polyglot's (excellent) advice. You might want to think about using fluorescent photo lights in lieu of incandescents, if your documents are at all valuable or rare. A lot of things may not enjoy having quite as much energy as photographic hotlights put out, dumped onto them, particularly the infrared heat.
Since with OCR you don't really care too much about color reproduction, you might be able to get away with fairly cheap fluorescent bulbs (although they make pretty good ones if you want to spend money).
I use fluorescent lights on my (very DIY) macro rig, and they are much, much nicer to work around than my old 3200K tungsten setup that I had when I was using film.
posted by Kadin2048 at 12:17 AM on March 25, 2007
Since with OCR you don't really care too much about color reproduction, you might be able to get away with fairly cheap fluorescent bulbs (although they make pretty good ones if you want to spend money).
I use fluorescent lights on my (very DIY) macro rig, and they are much, much nicer to work around than my old 3200K tungsten setup that I had when I was using film.
posted by Kadin2048 at 12:17 AM on March 25, 2007
This thread is closed to new comments.
I've scanned a lot of books. This is using consumer-grade equipment/programs. My workflow was always: scan to tiff, batch import tiffs to omnipage, OCR, convert to PDF. Omnipage can be setup to do it all automatically (basically one-click when you get it setup right).
Just out of curiosity could you post a link to the scanner. I've never seen one that uses SLR cameras for scanning.
posted by i_am_a_Jedi at 11:43 AM on March 24, 2007