Solution to OCR many bank statements into excel
September 29, 2012 7:32 PM   Subscribe

Need to get data from hundreds of pages of bank records into a spreadsheet. We have a scanner with a document feeder, but would love some recommendations on software/workflow ideas.

I've searched previous questions and have found some info about Abbyy finereader but I'm not sure if this is the current state of the art or what. I'm really new at this and would really appreciate suggestions for a good workflow for getting lots of bank statements into a spreadsheet to analyze. I know that the best option would be to download them from the bank in an electronic format but unfortunately that might not be possible.

I'm using Mac OS 10.7.4 but I also have a Windows 7 machine around that I can use if there are better solutions available for that platform.
posted by capnsue to Technology (6 answers total) 1 user marked this as a favorite
I don't have an answer, but an observation. Scanning and OCR from paper to spreadsheets is a very difficult task. Even with the best OCR program (and FineReader is, in my experience, the best) it is still very challenging. There will need to be a lot of post-scanning correction by human eyes.
posted by megatherium at 8:16 PM on September 29, 2012

The challenge of your task is going to be telling the software what to capture. (Unless it's your intention to capture everything - addresses, promotional messages, logos, etc.)

Are the statements in a similar format that you could set up a template for the OCR software to follow? If you can set up a template, it's going to make this a much easier task to automate. Otherwise, you'll be doing quite a bit of post-scan data remediation.
posted by 26.2 at 10:13 PM on September 29, 2012

I've previously mentioned Abbyy Finereader here, but I agree with the able comments that OCRing accurately to a spreadsheet is a very difficult task. And, with financial records, you want accuracy.

I can't remember if Abbyy Finereader has a trial version, but if it does, I would download the trial version and test it out for accuracy.
posted by dfriedman at 2:01 AM on September 30, 2012

Should read "above comments", not "able comments"...
posted by dfriedman at 2:01 AM on September 30, 2012

One approach you may wish to consider is outsourcing. You can scan the docs yourself, obscure personally identifiable information (e.g. by pasting a blackout template over every statement image, hiding the name, address, and account number headings), and then go to one of the many job-bidding websites to get the actual transcription done.
posted by Dimpy at 11:14 AM on September 30, 2012

I don't have the knowhow, but maybe this could be done with something like ImageMagick?

I'm thinking of a workflow like this:

scan the document to image file
use imagemagick to extract certain regions of each page, given that the statements will be printed on a uniform bank template (at least, on a per-account basis)
hand off those snippets to an OCR engine
insert results into database/spreadsheet.
posted by snuffleupagus at 11:28 AM on October 7, 2012

« Older Where to store an inheritance.   |   Kingsridge suits Newer »
This thread is closed to new comments.