Are there companies that are experts in digitizing pages of numbers?
August 28, 2017 1:14 PM   Subscribe

We have many physical volumes of public international census data we'd like to scan and OCR with high quality. Is there anyone out there occupying that niche?

I am at an organization that is exploring the possibility of scanning and digitizing a large volume of international public use data for the purposes of both preservation and making the data publicly available on the web. We have lots of in-house expertise on the munging of data tables into more digestible forms, but not on the scanning and OCR side of things. I want to find a company that has expertise in both archival quality scanning and OCR, preferably involving tables of numbers.

This is a representative example of what we're trying to digitize.

I can find misc OCR services online with Google. Why I'm asking the hive mind is for the particular requirements of archival quality and attention to detail, specifically with pulling in tabular data.
posted by mcstayinskool to Technology (5 answers total) 1 user marked this as a favorite
The term you're looking for is "service bureau" or "BPO" -- there are companies that do this on very large scales and on small scales; I work for a company in North Dakota that does it on a small scale.

One caveat: verified, archival, guaranteed OCR data is going to be expensive. We sometime outsource large projects, and even presumably-proofread human-entered data is...far from accurate. With hundreds/thousands of pieces of information per page, even if their error rate is 0.01%, that's still going to be a lot of errors over the whole project.

Edit: oh, I see you're in Minneapolis; I used to be more familiar with our competitors in that area, but many have closed/relocated -- MeMail me if you want to discuss.
posted by AzraelBrown at 1:39 PM on August 28, 2017 [1 favorite]

This is part of the Internet Archive's mandate. However you may want to reach out to your state archive as well. You should also be careful around any privacy and copyright issues. My country has a national census organization, perhaps yours does too, they are another resource to reach out to.
posted by saucysault at 1:48 PM on August 28, 2017

I used to work for IA and mostly like them but I would not use them for something that requires a very high degree of OCR accuracy.
posted by jessamyn at 2:33 PM on August 28, 2017 [1 favorite]

I do that! Or I have, at several jobs. And as mentioned - having human-verified OCR quality gets really pricy. However, scan-and-machine OCR accuracy has gotten very good over the years; you just need to be able to arrange the scan settings properly.

If I were setting up a scan project like that, I'd want:

1) Ability to chop the bindings off the books so they can be fed through the scanner. Without that, flatbed or photo-scanning are both a lot more expensive and a lot slower.

2) Scan at 400-600 dpi, black & white, instead of the normal 300, because you have tiny print and you want the numbers read accurately. (Order a sample and have an imaging specialist confirm that the correct settings are being used, because they are very much NOT standard and the default scanner settings will be 200 or 300 dpi.) 300 dpi is mostly fine... but if you have 8 pt and smaller numbers, the OCR won't be as good.

3) If you want the files converted from OCR'd PDF to Excel or some other format that will do the tables - contract for that work separately; don't expect to hire the same company to do it. (They'll offer. But a company with experience with scan & doc delivery will very likely not have skill with data extraction & doc formatting, and the results are likely to not be what you want.) (Side note: Demand to talk to the techies/production staff before hiring anyone; the sales people know NOTHING about scanning specs.)

Feel free to MeMail me for more details; this kind of thing really was a large part of my job for several years. Then scanner tech got better and the market evaporated - and now there's a growing number of companies that need those services, but there's very few companies that specialize that way.
posted by ErisLordFreedom at 2:35 PM on August 28, 2017 [1 favorite]

4) Confirm they have a double-sided ADF (auto-doc feed) scanner (or multiples, depending on scope and timeline of the project) that has two scanning surfaces - and that they're not using a large print/scan/copy device that does dbl-sided pages by feeding them through three times. (Once to scan front, once to scan back, and again to flip over into the "done" tray.) That's horribly slow, and increases the chance that something will get jammed or torn.

Fujitsu SnapScans are terrific devices. Canon also makes some great ADF scanners. Basically, though, find out if they have dedicated scanners rather than "we can totally use our printer to scan!" If they're using print/copy/scan machines, make really sure they have people who understand how the scanning process works.
posted by ErisLordFreedom at 2:53 PM on August 28, 2017

« Older Nursing shirts: what's with all the solid colors?   |   How can I avoid Facebook, Google and Amazon? Newer »
This thread is closed to new comments.