Poor man's EDMS (electronic document management system)
February 14, 2014 6:49 AM   Subscribe

I work for a small non-profit organization that is drowning in old paperwork, and we’d like to scan it into a searchable electronic archive. Our budget is limited.

I’m looking for an inexpensive, no-frills way to create such an electronic archive. We have a fairly fast, modern multi-function device that can scan the documents into PDF files and put them right on the file server. The challenge, though, is to implement some efficient system to retrieve any given document from the archive.

I was excited to find an open-source (free) program called Mayan EDMS. However, I had two different people test it independently, and both found it to be buggy and difficult to work with. The support forum wasn’t much help. In any case, Mayan is probably overkill for our needs. I’m not worried about creating different users, with different levels of security permissions. I’m also not overly concerned about seeing little thumbnails of the documents.

To provide some additional background, our most important document trove consists of roughly 20,000 packets of about 10-20 pages each, some of which have hand-written notes on them. Each packet has these critical pieces of information that would have to be searchable: Two ID numbers, and a person’s name & contact information. It wouldn’t be too difficult for me to create an automated way to print a cover sheet containing that information in a fixed location on the page. The cover sheet could be added as the first page of the packet. What the EDMS would have to perform is some kind of reasonably reliable OCR (optical character recognition) to capture that data and associate it with the PDF file. Then, there would have to be a way to search the archive for those pieces of information.

I solicited bids from two IT consulting companies, but their prices were outrageously high. I have access to a few very tech-savvy volunteers who could devote some time to this project, so I’m not necessarily looking for a pre-packaged solution that would work right out of the box. I do have a little bit of money to spend on the project (around $1,000), and I have an unused, modern server that’s already available to me. I envision the data as being hosted on-site rather than in the cloud.

Any thoughts on this?
posted by alex1965 to Computers & Internet (14 answers total) 17 users marked this as a favorite
A Premium subscription to Evernote should still (I think) do auto-OCR on PDFs, which you could then search.
posted by wenestvedt at 6:56 AM on February 14, 2014 [2 favorites]

You prefer non-cloud solutions so Evernote is out. If you have a Mac machine, buy DevonThink Pro Office. If you don't have a Mac, I would use your budget to get one, even a refurbished one, then install Devonthink Pro Office. That way you will have scanner/OCR support and a superlative indexing (an understatement) system. Will solve your problem.
posted by PickeringPete at 7:29 AM on February 14, 2014 [1 favorite]

Response by poster: I'm not totally averse to cloud-based solutions -- just thought that local hosting would be simpler and would avoid potential headaches (like our limited upstream bandwidth, recurring monthly service fees, etc.)
posted by alex1965 at 7:38 AM on February 14, 2014

I would actually go lower tech here. Just scan in the documents as individual files, number the files using a Bates system or somesuch, and create a spreadsheet index with the key data you need to track correlated to your index number, (ID numbers, name, address, etc). Save your money for fancy document management software and spend some of it instead on a good backup system for your data.
posted by yarly at 7:50 AM on February 14, 2014

Oh, and by way of reference: I just processed about 300 documents in the way I'm suggesting, and the data entry and scanning took me about 5 hours. So that means 20,000 might take around 300 hours of labor.
posted by yarly at 7:56 AM on February 14, 2014

Response by poster: yarly, I was looking for something automated (or semi-automated). Your system would certainly work but would be fairly labor-intensive. New packets are created at the rate of ten to twenty per day.
posted by alex1965 at 7:57 AM on February 14, 2014

I hope you find an automated solution here, because I would love it too for my nonprofit! My experience is that a once you have a semi-automated indexing system set up along the lines I outlined, it's not too labor intensive to maintain after you clear the initial backlog. I do hope that someone else has a more brilliant solution for us, though.
posted by yarly at 8:05 AM on February 14, 2014

I use a system called Cabinet in a small office. There is an annual fee that includes support based on how many workstations need to have the program open at one time. Since nobody here needs to access files on an ongoing basis, we pay a single annual license fee. Therefore, if someone opens Cabinet to access a document, they find what they need and log off. If that would work for you, the software and support is excellent and the single annual fee is affordable.
posted by elf27 at 11:05 AM on February 14, 2014 [1 favorite]

You can use neat scanner with $120 annual subscription should cover most of your electronic document archiving
posted by radsqd at 11:10 AM on February 14, 2014 [1 favorite]

If you are looking for scaled up version, this might be more your flavor. Abby is well known in the field of OCR and this seems more of a library+OCR in your own server solution
posted by radsqd at 11:17 AM on February 14, 2014 [1 favorite]

I design systems like this as my day job, basically. (Among other things.) I doubt that you are going to be able to afford a soup-to-nuts solution that you can just stand up and "just works" without compromising heavily on your requirements. (If you could, I'd be rapidly out of a job, and I am not yet out of a job.)

So you probably need to break up the problem into a couple of smaller, more manageable ones, and tackle them independently. First, document ingestion and indexing:

You're creating 10-20 packets per day, 10-20 pages per packet, so that's anywhere from 100 to 400 pages per day. It's not clear whether you want each "packet" to be a logical document within the system, or if there's an intermediate level in the hierarchy. (I.e., is it pages:->documents:->packets?) That will influence whether you can just scan in a packet as a single document or if you need to separate the documents within the packets and then relate them using an index field. The latter is more complex but might be more useful.

Then in addition to that, you have a 20,000-packet backfile that needs to be scanned in, which could be as much as 400,000 pages. Assuming you want to clear that backfile in less than a year, you'll need a scanner that can handle 1000+ pages per day on top of whatever it's being used for right now. Just something to keep in mind. A good workgroup scanner can do that, but it might need to be dedicated to the task.

Your idea of using index/separator sheets on top of each packet is right on. Although, if you are keying in index values for each packet in order to print them in an OCR-able form, you should really be putting them into a database (or at least a spreadsheet or something) at the same time. Then you could relate the data in the spreadsheet/DB with the paper documents via a barcode (much easier to read by machine than text!) that gets printed on the index sheet. So the scan program only needs to read the barcode, not actually do any real OCR.

It's really dangerous to OCR in key fields like account numbers that are going to be your primary retrieval criteria. E.g., if you print out index sheets and then OCR them, and a flyspeck turns a 0 into an 8 or Puttle into Tuttle, there's the possibility that a packet might basically just "disappear" or not exist from a user's perspective. So if that information is being keyed in somewhere anyway, it's best to avoid printing it out and then scanning it back in. OCR should only be used to obtain information that can't be gotten any other way.

If you are willing to DIY stuff, here's what I'd do:
  • Create (or have someone create) a simple database-backed, CRUD web app for doing your indexing and for generating the index sheets. Basically it's a page where you put in the key index fields for a packet, and it sticks those values into a database and associates them with a generated doc ID number, which it puts onto a PDF as a barcode and in human-readable form, and then hands you back the PDF. This avoids OCRing of critical index data. You put the sheet on top of the packet and it goes to the scanner. You do this process as part of the doc preparation, which includes taking all the staples out, getting rid of Post-Its, etc.
  • Use your scanner's capture software to scan in the docs. Kodak Capture, which comes with the Kodak workgroup-level scanners, will do document separator sheets and read barcodes. I think Fuji's will too, but I'm not sure. You'll need to have a scanner that's actually connected to a workstation for this, not a shared MFP machine. If you only have a consumer scanner ... you'll need to get a new scanner, or at least better software that supports the scanner. It's probably cheaper to buy a scanner and get the software free than to buy an "enterprise" capture package. Have the scan package name each file based on its barcode number, or something else simple. I'd do this periodically, like once a week or so, when you have a pile of them.
  • Now you have scanned documents with the barcode number associated, and you have all your index numbers in a database, also with the barcode. From here you want to stick the documents into your DMS. Most big commercial DMSs have ingestion utilities that will do the job; most of the free ones I've played with ... don't. N.B.: At this point you basically have yarly's "lower tech" solution: you can go into the database and perform a query, getting back a document ID, and then find that document on the filesystem and open it up. You could get to this point, or work towards it, right away, and then put the DMS piece on top later.
  • Assuming your chosen DMS doesn't have its own bulk ingestion tool, I'd write a little utility (probably run periodically, i.e. as a cron job) that monitors a shared folder and looks for documents, and when it finds them looks for matching IDs in the database, and if it finds a match, sticks the document and the index data into the DMS. Actually writing this would depend on the DMS and its API. Some systems support simple REST APIs, others might require knowledge of Java; your level of comfort might drive your selection of a DMS.
It used to be possible to get scanners that had a Bates-type stamp on them, and would physically stamp a document ID onto the paper before it ran through the scanner, while also encoding that number onto the scanned image as metadata. That would let you skip the coversheet-creation step and index after scanning rather than before. But I haven't seen a scanner (even high end models) like that in years. If you could find an old one around for cheap and get it working, it's sort of neat though.

What you are trying to do isn't trivial, and any way you cut it there's going to need to be a significant investment, either of labor or money, into this system. Given the amount of your backlog, I would look hard for a way to separate ingestion from the DMS and avoid anything that locks up your documents inside somebody's DMS in a way that makes them difficult to get out. The time involved in scanning them is going to be huge and you don't ever want to have to do that again once you do it once.
posted by Kadin2048 at 12:53 PM on February 14, 2014 [8 favorites]

You have not identified your platform - Mac OS X, Windows, Linux - and that makes an intelligent answer difficult.

Under Mac OS X, another excellent option for dealing with the files already on the system is a small app called Leap. It won't do OCR, but it will allow for quick review, tagging, and a short description for each file. Doing this for a small collection of the key files can be very worthwhile. The key is to start a scanning system for now and the future, and take the time you need for the legacy documents. After a while, the old documents lose their importance.
posted by yclipse at 1:58 PM on February 14, 2014

Alfresco? Using it for DMS is kludgy but doable.
posted by snuffleupagus at 8:14 AM on February 15, 2014

Response by poster: Thanks for all the feedback. Some very useful leads here! I'm going to mark the question as "resolved", even though I haven't actually had a chance to investigate the tips I've received. Thank you to everyone who offered advice.
posted by alex1965 at 10:55 AM on February 16, 2014

« Older Reliable auto mechanic in Columbia, SC?   |   How to play my own music snippets randomly and... Newer »
This thread is closed to new comments.