small-scale document digitization services? document retention policies?
September 6, 2016 8:16 PM   Subscribe

I work for a small non-profit that has had six linear feet of historical records on paper in somebody's basement for the last few decades. Can the hive mind (1) recommend a company that can scan/OCR these papers for us and (2) recommend resources for teaching me how to design and implement a document retention policy?

Ideally we'd have the document retention policy first and execute it on these papers, but I suspect designing the policy is going to be a long, frustrating process. So I'd like to get these papers digitized first, in a way that keeps open our options.

So, in the short term, can anyone recommend a company that will scan and OCR six linear feet of papers for us, preferably cheaply? They're almost all 8.5x11", printed on both sides mostly with black and white text.

In the long term, I'd like to design and implement a document retention policy. I have not the faintest idea how to go about this. The ever-bountiful internet has advice for many different people with many different needs, and almost all of them have stronger requirements and/or bigger budgets than we do. Any advice, even a little bit of link curation, would be helpful.
posted by d. z. wang to Grab Bag (4 answers total) 3 users marked this as a favorite
 
Six linear feet isn't very much. You might consider just buying a Fujitsu scan snap and doing the digitization yourself. This way you have a way to keep things named and organized in a way that makes sense to you, and for future pruning. It wouldn't take long to do, and you could catch up on that tv show you have been meaning to watch. It also makes future digitization fast and easy.
posted by rockindata at 8:38 PM on September 6, 2016 [2 favorites]


Fedex office (kinkos) does this.
But I'd also rather suggest purchasing your own doc scanner. The fujitsu above or, I prefer the Canon c225, are both good.
posted by artdrectr at 10:13 PM on September 6, 2016


Disclosure: what you're describing is what I do for a living.

First, review this related answer I've given in the past: summary, look for a "service bureau" or BPO service, scanning and OCRing is what they do.

Our service bureau would definitely do six linear feet for you, at probably about $0.20-$0.25 per page with an unproofread OCR, as a baseline for pricing, not an offer of services (I'm not a salesguy, that's just a guess)

If six linear feet constitutes years of document storage, your "document retention policy" can be as simple or as complicated as you make it, you just have to make people follow it. Example:
  • All documents must be scanned and OCR'ed within one week of receiving;
  • Physical copies of documents are stored for one month and then destroyed;
  • Digital copies are stored for five years and then destroyed.
Now, actually DOING this is the important part: train your people, get good equipment, and have a good way of organizing your digital copies.

The biggest question I have is: why do you want them OCR'ed? What do you want to do with that OCR data?

Organizing digital documents generally involves a "document management system", there's lots of software out there, we're an EMC shop so ApplicationXtender is what we use, but it's probably out of your pricerange since it's per-seat and server licensed, not really a "one PC to look up a small number of documents".

But, I digress: the key to any document management system is finding things. If you can't find the documents you're looking for, digitizing them made no sense. If you're planning on OCR'ing everything with the hope you can just do a search for any word within the document: this can be done, and we have customers who work this way, but this route is fraught with peril. OCR is far from perfect, and even if it tells you which document it found the word in, how do you know that document is relevant to what you're searching for? You generally need 'indexes', a meta-data structure of specific pieces of data, so when you see a document listing in the computer, you know what it is.

So, if you buy an off-the-shelf consumer level scanner, with software that will create a PDF with the OCR'ed text in it, and then name every file like "letter-from-john-smith-2016-08-23.pdf", now you can search within Windows for any word in the document, and the filename will tell you what document you found. This is the most basic and simple document management system you could in theory use, and we have customers that do it this way (one of the ones that brings us 4 linear feet a year to scan for them).

But: this system is only as good as:
  • The OCR results;
  • The person naming the files
OCR results are only as good as the original document and the OCR software; if you want them proofread, that gets spendy; usually OCR results are delivered exactly how the software identified the text, because paying a person to read every page and fix issues is prohibitively expensive. So, if you've got older typewritten documents, you might find that "Jensen" becomes "Jonson" after OCRing because of light center lines in the 'e's. Now, your search results for "Jensen" become incomplete. It's best not to rely purely on OCR, which is what one of our customers is finding out because they can't ever find what they're looking for.

If your secretary, or whoever is assigned to do this, suddenly forgets he needs to put the date on the filenames, now you've got a problem. Or, if they start shorthanding, thinking they're being efficient, and now the filenames are incomprehensible, it's no longer a useful system. The key is "what pieces of information do I need to know what document I want to open, so that the one I pick is certain to be the one I want". If you have to open twenty PDFs to find the right one, you're doing it wrong.

A formal document management software package will probably have a higher quality OCR program than AABBYY, and will force you to create a structure of required, data masked, and finely-tuned list of fields of metadata for the scanner operator to fill in with each document they scan.

Scanners: we're also a Canon shop, so I recommend the DR-M160II -- it is fast, scans both sides in one pass, and has really good quality that OCR needs. That both-sides and its speed are the biggest thing to think of: if you buy a $50 flatbed from OfficeMax, time what it takes to scan one document, then take times the number of documents you've got. You don't want this to be a two-year project.

Buying a scanner is for future incoming documents, if you're going to hire a service bureau to do your archive scanning for you. The one thing in this answer you'll need to know before going to a service bureau is this part: "what pieces of information do I need to know what document I want to open, so that the one I pick is certain to be the one I want". Otherwise, what you get back from the service bureau will be exactly what you asked for, and if you still can't find a particular document when you need it, then sending the documents off for scanning was a wasted project.
posted by AzraelBrown at 5:18 AM on September 7, 2016 [4 favorites]


To build on what AzraelBrown said, from someone who used to do this professionally: the scanning is the easy part, it's the indexing that's hard. Running documents through a scanner is literally mindless work; humans get paid to do it only because building machines to unstaple and flatten out reams of paper is too hard. But indexing — giving those documents a name, or otherwise recording metadata about each document as it's scanned, and also breaking a series of sheets into logical "documents" — that is hard. Depending on the type of documents, it may actually require some level of subject-matter expertise, so the person doing the work either needs to be familiar with the documents or they need to be trained, or have very specific rules.

I would work backwards from the end-state. How do you want to search through the repository and find a particular document in the future? Would you be happy to navigate through nested folders by date? (E.g. folder by year, subfolders by month...) Do you need more granularity below that? If that's what you want in terms of a schema, do all the documents have dates on them?

You can do the indexing and the scanning separately, or at the same time. Most good scan software will let you do the indexing as you scan, breaking pages up into logical documents and then populating the index fields into the filename, PDF metadata, or into sidecar index files depending on how you want it. (I am only really familiar with Kofax Capture, which is expensive, and the Kodak Capture software, which is free with their scanners.) If you want to do indexing later on, then generally you just have the scanner capture to a bunch of page images, and then you do indexing using a separate program that assembles them back together -- this generally means more expensive software, unless your only "indexing" is renaming the files, in which case you can do it in Windows Explorer or the Mac Finder.

As for the document retention policy design, you need to think about what your goals are. When I've done doc-retention stuff, it was for big corporations who were mostly interested in reducing their legal exposure. Basically, they wanted to destroy everything just as soon as they were legally allowed to, only keep records around for the minimum amount of time required, etc. The perfect organization in this view would be amnesiac, knowing nothing, retaining nothing, discoverable by nobody. This is a very "scorched earth" approach to records management, generally pushed top-down from Legal or Compliance, and you should expect -- if that's your attitude -- to basically ride roughshod over everyone else in your organization like Genghis Khan with a paper shredder in order to enforce it. You will likely be hated, and people will hide things from you whenever possible, because these policies necessarily involve destroying lots of content that still has value to someone (presumably why they're retaining it) in order to reduce risk to the organization as a whole. YMMV.

Better (IMO) retention policies take into account the actual value of retained documents to everyone in the organization, not just risk reduction. You'll need to go around and talk to people about how long they think things should be kept for (which may be "forever and a day") and also whether there's risk involved in having those documents around. But my experience is that you'll only get people to use a document management system without bloodshed if there's a positive aspect to it: it has to be better -- safer, easier to search, whatever -- than just keeping the paper in a box (which is a pretty good storage mechanism, honestly) or else people won't bother.
posted by Kadin2048 at 1:11 PM on September 8, 2016 [2 favorites]


« Older Trying to Find This Funny Video   |   Best Kindle version of Shakespeare Newer »
This thread is closed to new comments.