The paperless life, on Linux
April 7, 2013 12:33 PM Subscribe
Is there an integrated system that will let me digitize scraps of paper very quickly? The intended use case would be removing the paper clutter from a busy person's desk. Household uses Linux, Android phones, an iPad and a Mac.
I'm a fairly disorganized person in the physical realm. My desk is overflowing with diverse pieces of paper that I don't want to throw away because they might be interesting at some point in the future - credit card bills, event flyers, bills paid and unpaid, invoices, a one page quick guide to avalanche rescue etc.
Instead of cleaning it up, I prefer to engage in technological escapism, and I've convinced myself that it should be possible to just quickly suck the useful essence of these things into the digital realm, where I have terabytes of unused storage sitting around, and discard the physical husks afterwards, leaving me with a cleaner desk and more accessible information as well.
The workflow I'm imagining is that I'd take pictures of/scan these things quickly (ideally on either the computer or a mobile device), then get them automatically deskewed/OCRed/tagged with essential metadata (time/location/source device) and thrown into an editing queue for potential manual reprocessing or sorting into structures. Since it's 2013, this would ideally also include the mobile devices for both capture and access.
Is there any integrated system that will fit my requirements? I would prefer to run something open source to run on my own Linux server, but I'm willing to look into service-based solutions as long as it allows me a reasonable amount of backup capability to make sure they don't get to blackmail me too much for keeping my data accessible.
PS: I'm willing to pay unreasonable money for a physical device like the Rainbows End library digitizers - the satisfaction of throwing this stuff into a gaping maw that turns it into separate piles of shreddar and lovely bits must be priceless!
I'm a fairly disorganized person in the physical realm. My desk is overflowing with diverse pieces of paper that I don't want to throw away because they might be interesting at some point in the future - credit card bills, event flyers, bills paid and unpaid, invoices, a one page quick guide to avalanche rescue etc.
Instead of cleaning it up, I prefer to engage in technological escapism, and I've convinced myself that it should be possible to just quickly suck the useful essence of these things into the digital realm, where I have terabytes of unused storage sitting around, and discard the physical husks afterwards, leaving me with a cleaner desk and more accessible information as well.
The workflow I'm imagining is that I'd take pictures of/scan these things quickly (ideally on either the computer or a mobile device), then get them automatically deskewed/OCRed/tagged with essential metadata (time/location/source device) and thrown into an editing queue for potential manual reprocessing or sorting into structures. Since it's 2013, this would ideally also include the mobile devices for both capture and access.
Is there any integrated system that will fit my requirements? I would prefer to run something open source to run on my own Linux server, but I'm willing to look into service-based solutions as long as it allows me a reasonable amount of backup capability to make sure they don't get to blackmail me too much for keeping my data accessible.
PS: I'm willing to pay unreasonable money for a physical device like the Rainbows End library digitizers - the satisfaction of throwing this stuff into a gaping maw that turns it into separate piles of shreddar and lovely bits must be priceless!
Get a ScanSnap scanner, plug it into the Mac, and sync it with Evernote.
posted by dfriedman at 1:28 PM on April 7, 2013
posted by dfriedman at 1:28 PM on April 7, 2013
Got a Scansnap S1300 for Christmas. Scans double sided at once, can OCR the text (including taking highlighted words as keywords in the resulting PDF). Use that with Hazel on the Mac to handle a lot of the general sorting - I have rules set up that says if the OCR'd text includes a certain account number, file it in a particular folder. ScanSnap + Hazel is a really powerful combination.
posted by neilbert at 1:48 PM on April 7, 2013 [1 favorite]
posted by neilbert at 1:48 PM on April 7, 2013 [1 favorite]
The ScanSnap plus Evernote approach is what I use on a Mac OS X system.
Add to it a Doxie Go for mobile uses, within the constraints of its battery.
posted by yclipse at 2:40 PM on April 7, 2013
Add to it a Doxie Go for mobile uses, within the constraints of its battery.
posted by yclipse at 2:40 PM on April 7, 2013
I use DevonThink (Pro Office) on my Mac. It has a pretty good web server so you can self-host and access it from your Linux box. (My Linux box is a plex/timemachine/django-dev server; the year of Linux on the desktop came and went for me over a decade ago. I wish I knew of something half as useful as DevonThink for Linux.)
posted by Brian Puccio at 7:00 PM on April 7, 2013
posted by Brian Puccio at 7:00 PM on April 7, 2013
ScanSnap as scanner. No question.
Software: don't bother tagging. Use filename of "YYYY-MM-DD company - title.pdf", use OCR and whatever search tool you like. Probably Tracker on Linux.
Scanning + OCR to PDF under Linux: I didn't found a good tool that does it all. I would guess gscan2pdf + tesseract-ocr would get you very close to what you want.
See Duncan Brook's blog too.
posted by devnull at 1:42 AM on April 8, 2013
Software: don't bother tagging. Use filename of "YYYY-MM-DD company - title.pdf", use OCR and whatever search tool you like. Probably Tracker on Linux.
Scanning + OCR to PDF under Linux: I didn't found a good tool that does it all. I would guess gscan2pdf + tesseract-ocr would get you very close to what you want.
See Duncan Brook's blog too.
posted by devnull at 1:42 AM on April 8, 2013
We have a ScanSnap in the lab and it is amazingly useful. Scans a pile of paper at a time, automatically does double-sided pages, and folds up pretty small when not in use. You can't miss.
posted by caution live frogs at 11:26 AM on April 8, 2013
posted by caution live frogs at 11:26 AM on April 8, 2013
Depending on how you intend to use the scanned files, it may be enough just to OCR the scans. I was going to set up something to filter and sort based on account numbers, business names, etc, then I realized that, most likely, I would never need to retrieve a scanned document, and that when I did, doing a little search refinement could get me what I needed pretty quickly. So, rather than paying the cost of better organization up front, I'd pay it at retrieval.
posted by Good Brain at 5:20 PM on April 8, 2013
posted by Good Brain at 5:20 PM on April 8, 2013
Response by poster: Thanks everyone! CamScanner is indeed pretty close to the capture experience I was imagining, and if I decide to upgrade to dedicated hardware it seems there's also a fairly uncontroversial choice of document scanner.
posted by themel at 2:18 PM on April 11, 2013
posted by themel at 2:18 PM on April 11, 2013
« Older Right way to format and partition an external HD... | How to stay motivated as a 20-something facing... Newer »
This thread is closed to new comments.
posted by Leon at 1:06 PM on April 7, 2013