Getting Photos Ready for OCR
February 9, 2022 3:35 PM Subscribe

What is the optimal way to process photos of text pages to OCR? The last set of questions on this process is from more than five years ago, and it seems to be a pretty niche area for my youtube and google searches.

I am helping a friend with a project. Because of lockdown, access to archives was minimal, so my friend took photos of the relevant texts. Unfortunately, my friend's relationship with technology is terrible. So I have many old iPhone pictures of book pages at 72 DPI.

Here is what I have access to do post-processing:

* Photoshop 2022
* ABBYY Finereader on the Mac (both new and old versions)
* Adobe Acrobat

To maximize quality OCR, what is the best way to process photos of sepia-colored book pages (these texts are old)?

Complicating factors:
1) ABBYY seems to choke on italics;
2) The OCR'd languages are 19th and early 20th century Italian and ancient Roman Latin.
3) Playing around with the files, I get readable text BUT crap OCR to the point that I am typing in material correctly, a lot.

So, how would you go about taking a photo of a book and getting very good OCR out of it?

posted by jadepearl to Computers & Internet (11 answers total) 5 users marked this as a favorite

Start by throwing this stuff into Tesseract and seeing what comes out.

Tragically you always need to check this stuff yourself, but Tesseract is free and about as good as it gets.
posted by mhoye at 3:55 PM on February 9, 2022 [2 favorites]

There are a couple document management systems that have put some effort into optimizing the pipeline. These all use tesseract for the actual OCR part, but they all have preprocessing steps which attempt to produce the best input for tesseract.

OpenPaper.work

Paperless

Paperless-NG
posted by RonButNotStupid at 4:14 PM on February 9, 2022 [2 favorites]

> book pages at 72 DPI

Is that the equivalent page resolution, or just another Photoshop-ish term for "no particular resolution"?

You'll need above 150 dpi to get anything useful. It's not really worth trying anything lower. Also, make sure that Tesseract knows about the language you're working in, or it'll assume every é is a 6.
posted by scruss at 5:42 PM on February 9, 2022 [3 favorites]

Have you tried the hacky OCR Using Google Docs approach? Link goes to some ad-ridden site but the general process is solid and sometimes works.
posted by jessamyn at 7:43 PM on February 9, 2022

I've not had issues with ABBYY before. You won't get better than that or Tesseract.
posted by turkeyphant at 9:47 PM on February 9, 2022

OK, here is the process so far:

1) batch process the janky photos, including rotation, and resampling, upscaling to 266-300 DPI, and creating another layer in photoshop;
2) further voodoo is done in photoshop using a combination of curves of whitening the background, darkening the pigment of the text, and some sharpening and depending on the level of text curve, using transform-warp function;
3) output to a high-quality tif file;
4) testing right now between ABBYY and Tesseract. So far, ABBYY under Finereader will do a better deskewing and straightening of curved text. Right now, both of them seem to have issues with serif italics.

I am trying to figure out how to get tesseract to OCR a folder of files. Right now, I am doing it one file at a time and then cat command.

Using a M1 iMac and unfortunately all the MacOS thidr-party GUI/front ends do not work on my machine and yes, that includes the Java run times, too.
posted by jadepearl at 12:37 AM on February 10, 2022

If you're comfortable opening up the Terminal, some simple shell scripting may help here:

for file in $(ls *.tiff); do tesseract $file ${file%.*}; done

This will go through a list of all the *.tiff files in the current directory (in filesystem order) and run "tesseract myfile.tiff myfile" on each of them.

Keep in mind the way filenames are sorted.

file1.tiff, file2.tiff ... file10.tiff will be sorted very differently than file01.tiff, file02.tiff, ... file10.tiff.
posted by RonButNotStupid at 4:35 AM on February 10, 2022

I do document imaging for a living -- some suggestions:

DPI needs to be something high, we prefer 300dpi but as others said, you need to do 'pixel math' if it's reporting 72dpi -- how many pixels wide, divided by the estimated size of the original, gets you dpi. Like, does your image say "72dpi, 35inches wide", that means your DPI is actually 300dpi for 8.5 inch wide image. I don't know any scanner to camera that would produce something at 72dpi so you may need to revise what you think the resolution actually is. If it's already 300dpi after doing the math, but you're doing a bunch of processing to "make" it 300dpi, you may be adding noise to the process that's hurting things later. If you have TIFF images it's possible to just change the dpi tags without modifying the image. As long as you have the right number of pixels, the reported DPI is irrelevant for most cases.
Three things you should consider applying to your images: threshold, dilation, and erosion. Threshold turns your image into either black pixels or white pixels, nothing in between, so the OCR can get a good bead on what's text and what isn't. Dilation adds pixels around the edge of the text to make it heavier; erosion takes pixels away around the edges of characters to make them lighter. Depending on the quality of the image, use erosion and dilation as needed, but OCR loves a thresholded 1-bit color depth image.
There are what seems like a million command line utilities which can automate these processes; ImageMagick is a big swiss army knife of image processing, there's libtiff which I use a lot because tiffs are just easier to work with. Mixing and matching utilities in batch files and letting them run for days used to be how I spent my time. Unfortunately I am not familiar with doing these on Macs.
I've had much better success with Tessaract when OCRing, but ymmv.

posted by AzraelBrown at 6:28 AM on February 10, 2022 [4 favorites]

In reply to what I am seeing: One example photo is 640px by 480px; resolution 72 (photoshop info). I threw the file raw at both Tesseract and ABBYY (both versions), and the output was non-existent and unable to be read, e.g., Tesseract output was blank.

The same file with enlargement and resolution adjustments (see earlier above) got output.

@AzraelBrown Tried threshold, and because of the sepia tone levels, it did not produce clear black and white. However, I can provide the original file to see the situation.

The moral of the story is that being a renegade archivist/translator with shaky phone camera files is not recommended.
posted by jadepearl at 12:53 PM on February 13, 2022

As stated above, "72 dpi“ is meaningless without knowing the size of the image.

For anything more than a few words, 640 px * 480 px seems quite small. You'd be better off starting again by taking higher quality source photos.
posted by turkeyphant at 5:54 PM on February 13, 2022

Unfortunately, unless you're working with business card sized documents, if you're starting with 640x480 images no amount of photoshop can add the pixels you need to make these images usable. There is almost no way you are going to get successful results with these images.

My suggestions to get better images:

A letter sized document at 300dpi is about 8 megapixels; a letter-sized fax is about 2 megapixels and is the lowest threshold for text recognition. You need a camera which can support somewhere in between these levels of resolution. (for comparison, 640x480 is only one third of one megapixel for a 8.5x11 sheet)
If the sepia paper and the text on it are too close in color for threshold to separate the two (and threshold usually has a setting from 0 to 255 to adjust, did you try everything in between?) you need a better light source -- not necessarily camera flash since that is going to be super bright and right against the page, but find a bright ambient light source so the camera can adjust to the light rather than struggling to grab a useful image. Most book scanning equipment I've used has two lights, one from the left and one from the right, to try and cancel out shadows.
If these are bound books, build yourself a little book holder shaped like a V that the book can lay in -- this will keep the pages from "humping" in the middle due to how they're bound into the spine and keep the lines of text straighter. You may also want a sheet of thick clear plastic to lay on the page to help hold it flat.

posted by AzraelBrown at 12:42 PM on February 14, 2022 [1 favorite]

« Older Recommendations for self help books that... | Objects doing exercise, on a t-shirt Newer »

This thread is closed to new comments.

Ask MetaFilter

Getting Photos Ready for OCR
February 9, 2022 3:35 PM Subscribe

Tags

Share

Getting Photos Ready for OCR February 9, 2022 3:35 PM Subscribe

Tags

Share

Getting Photos Ready for OCR
February 9, 2022 3:35 PM Subscribe