Where do I start with document scanning?
April 15, 2010 8:54 AM   Subscribe

Where do I start with document scanning?

I have a couple of filing cabinets at home full of paper wherein the informational content may come in handy, one day, but the paper itself is irrelevant. I would like to scan in this paper and then dispose of the paper after the digital documents have been appropriately backed up. Some pieces are double-sided, some are not. Some pieces of paper are logically connected to other pieces of paper. Most are letter-sized, some are not.

After reading a few AskMes, I have come to realize that I do not even know the right questions to ask. Here are some of the questions I have put together:

1) Do I want scans to end up as PDF or multi-page TIFF? Could I get both at the same time?

2) Do I rely on my elaborate directory structure (which is pretty good so far), or do I want OCR as an additional help when looking? Would that end up as some set of keywords in the metadata of the PDF? As a separate Rich Text Format file? (I assume that if I do, ABBYY FineReader is the software of choice.)

3) What ought I to look for in an Automatic Document Feeder?

4) Will I be sad if I choose the SnapScan only to find there is no TWAIN interface?

5) What are warranties like on these devices? Do I buy one and scan like mad until the warranty expires and the magic smoke emerges?

6) What are the gotchas of document scanning?

7) What else have I failed to consider?

The target machine is Windows XP and will not be connected to the Internet. I am disinclined to consider any cloud/online storage. I understand many people like that sort of thing; I will not be going in that direction. I am not looking for product recommendations so much as ideas on what sorts of things I ought to investigate or of which I could be wary.
posted by adipocere to Computers & Internet (14 answers total) 14 users marked this as a favorite
 
Best answer: Check your mail.
posted by shew at 9:28 AM on April 15, 2010


Best answer: I highly suggest checking out EverNote. I've found their ability to search PDFs and images just in the 4 months I've been using it. I'm slowly trying to make our entire apartment paper-free moving forward.
posted by MCMikeNamara at 9:40 AM on April 15, 2010


Best answer: One version of Adobe Acrobat has a nice image+OCR format, allowing you to search and copy the text, and keep the formatting as visibly displayed, instead of trying to re-create the formatting and guessing at line spacing and whatnot. I'm not sure the price or version, as it came with my home computer. I'll check when I'm back home.

As for multi-page TIFF: you would have to generate a text document alongside the TIFF if you wanted to search the document text, and not all image viewers support TIFF. You can find free image viewers that support multi-page TIFF, if this is the route you take.
posted by filthy light thief at 9:44 AM on April 15, 2010


Best answer: I noticed a good scanner on Cool Tools recently. Ties in with Evernote well. Might be a good start.
posted by Static Vagabond at 9:46 AM on April 15, 2010


Best answer: Depending on volume, it might be worth your while to think about outsourcing this. I'm currently (as in, this is what I've been doing all day today) looking into vendors and getting prices generally on the order of $0.07/page, for volumes of > 10,000 pages.
posted by dmd at 10:13 AM on April 15, 2010


Response by poster: I should add that my "dream solution" ends up with each scanned document resulting in a PDF with some embedded text in a fashion that filthy light thief suggested, an RTF, and a multi-page TIFF. Portability, searchability, and archival replication.

One Directory Structure to rule them all, One File Format to find them,
One Software to scan them all and in NTFS bind them.

posted by adipocere at 10:30 AM on April 15, 2010 [1 favorite]


Please note that the scanner on Cool Tools is the older version. The newer version is the s1300 and it's dual PC/Mac compatible (previously you had a to choose a PC or Mac version during purchase).
posted by sharkfu at 11:15 AM on April 15, 2010


What you have failed to consider is the questionable durability of digital documents.
posted by neuron at 11:27 AM on April 15, 2010


Best answer: A reasonable combination of onsite and offsite backup systems can ensure redundancy and eliminate the 'questionable durability of digital documents'.
posted by shew at 12:00 PM on April 15, 2010


Best answer: A quick tip for scanning double-sided documents: to prevent the text on the other side of the document showing through, place a black sheet of paper, like construction paper, on top of the document as you scan it.
posted by telophase at 12:02 PM on April 15, 2010 [1 favorite]


Best answer: If you really want each files in three formats you probably want to use the "Scan to Folder" action to produce a folder of TIFF images, then process these programmatically. I'd try writing a script to automatically generate PDFs using something like pdftk, do OCR with FineReader/Acrobat (both of which are packaged with ScanSnap scanners, and then extract the text as RTF.


What you have failed to consider is the questionable durability of digital documents.
Really? adipocere has suggested making redundant copies of each file in three separate formats, each of which is stable, widely-used, and well-documented.

But of course, you should make sure that have lots of geographically distributed back-ups - Lots of Copies Keep Stuff Safe. And migrate to new storage media periodically. I'd rather have my important documents in multiple locations than in a single drawer.
posted by James Scott-Brown at 12:02 PM on April 15, 2010


While I agree with dmd about outsoursing this as a decent idea, this task is mostly about organization and electronic filing than it is about the physical task of scanning. Only the OP can decide what folder structure and naming convention works for him. Outsourcing this will eliminate the paper (the day you do it) but it will never be quite as useful as if it had been done by you, for you.
posted by shew at 12:04 PM on April 15, 2010


Best answer: A quick tip for scanning double-sided documents: to prevent the text on the other side of the document showing through, place a black sheet of paper, like construction paper, on top of the document as you scan it.
This is good advice for a flat-bed scanner, but it won't help with an auto-feeding scanner (such as a ScanSnap), which will automatically pull a single sheet of paper through it at a time.
posted by James Scott-Brown at 12:05 PM on April 15, 2010


I'm currently in the middle of doing exactly this--scanning the contents of all my file boxes (journal articles and other miscellany from 6 years of university).

For me, PDF has been adequate so far (stored on my primary machine, backed up to an external hard drive and a thumb drive). I'm currently having lots of luck with the solution mentioned by filthy light thief--Adobe Acrobat's "Searchable Image" scanning option (image and OCR). Everything looks crisp and every little bit of text is recognised and searchable.

This might not be useful to you, but for a scanner I just went with a cheapish ($100CAD) all-in-one deal with an automatic document feed--the Canon Pixma MX350--and it's been perfect so far. I needed a new printer, though, so that might be overkill for you.

Still searching for a good file organization method. I'll be checking back here.
posted by 1UP at 3:03 PM on April 15, 2010


« Older Honda Element Good or Bad?   |   What books should I buy my math professors? Newer »
This thread is closed to new comments.