Need an electronic filing cabinet.
September 1, 2012 12:41 PM Subscribe
I'm getting a scanner. I would keep my personal files electronically. Is Evernote Premium pretty much my option, or is there some other less expensive alternative I can use if I don't need cloud storage?
The big huge things for me are tagging and the ability to search OCRed text in PDFs--that's the only reason I really want to go through the effort of digitizing my current stuff. Evernote Premium will do the text search thing, but that's $45/year basically forever? And I'm really not keen on that. This isn't the sort of stuff I need to get to from anywhere; I'm fine with it just being on my desktop computer and periodically backed up. All of the files will be PDFs. Is there something that will work well for this without the ongoing service charge?
If Evernote really is the best thing for this, that's what I'll probably end up with, but I'm hoping to be able to trim the cost a bit since I really don't need to access my lease from the web, I just want to be able to pull it up on my home computer without having to physically search my boxes of papers every time.
The big huge things for me are tagging and the ability to search OCRed text in PDFs--that's the only reason I really want to go through the effort of digitizing my current stuff. Evernote Premium will do the text search thing, but that's $45/year basically forever? And I'm really not keen on that. This isn't the sort of stuff I need to get to from anywhere; I'm fine with it just being on my desktop computer and periodically backed up. All of the files will be PDFs. Is there something that will work well for this without the ongoing service charge?
If Evernote really is the best thing for this, that's what I'll probably end up with, but I'm hoping to be able to trim the cost a bit since I really don't need to access my lease from the web, I just want to be able to pull it up on my home computer without having to physically search my boxes of papers every time.
Response by poster: Okay, duh, yes, I'm on Windows 7. Google Desktop stopped being supported last year, unfortunately. Windows 7's built-in search has not impressed me at all.
I am getting a sheet-feeder; it's used through a friend, so that's already set. I have Acrobat and so I don't need software that will perform the OCR, just software that will search through it when I'm looking for things.
posted by gracedissolved at 12:58 PM on September 1, 2012
I am getting a sheet-feeder; it's used through a friend, so that's already set. I have Acrobat and so I don't need software that will perform the OCR, just software that will search through it when I'm looking for things.
posted by gracedissolved at 12:58 PM on September 1, 2012
You can scan everything to PDF or multipage TIFF, and just store it as descriptively-named files. Both formats have fairly wide support, and both support metadata that your OS can use to search for particular documents.
Personally, I went with TIFF years ago largely due to Microsoft's MODI/MDIW package included with Office, because it supported OCR, drag-and-drop page rearrangement, inline annotation, print-to-tiff, all sorts of good features that no (good) non-commercial PDF software had at the time. Unfortunately, despite "knowing" better than to lock in to a semi-proprietary system, MS discontinued MODI as of office 2010, and even if you don't mind not upgrading (I don't), the 32 bit MDIW print drivers don't work on 64 bit versions of Windows. That, combined with much better FOSS PDF support these days, if I had to make the same choice today, I'd probably go with PDF.
But no, you don't need any fancy document imaging package to store your files electronically. I would even suggest not using a document imaging package for similar reasons to my regrets about MODI - It may have the best features ever today, but will it still work 15 years from now? If you do need one, I'd say go with a package that only helps you manage the (stored in-file) metadata, so if you need to change at some point, you don't need to scramble to figure out how to export all your data.
One last comment on file sizes - 300dpi black-and-white CCITT4 encoded files will look great for archival purposes and take up next to no room (under 50k for clean typewritten pages). When you start scanning at 600dpi in greyscale or color, expect your documents to take up 2-4MB+ per page... Which may not sound all that big, but after roughly a decade of scanning everything I would otherwise have stuck in a filing cabinet, my "digital credenza" currently contains somewhere around 20k pages spread through 4400 files - Which at 3MB per page would weigh in at 60GB. As it stands now, I can just barely still fit a full backup on a single DVD (but will probably need to upgrade that to burning it to a BR disc by the end of the year).
posted by pla at 1:07 PM on September 1, 2012 [1 favorite]
Personally, I went with TIFF years ago largely due to Microsoft's MODI/MDIW package included with Office, because it supported OCR, drag-and-drop page rearrangement, inline annotation, print-to-tiff, all sorts of good features that no (good) non-commercial PDF software had at the time. Unfortunately, despite "knowing" better than to lock in to a semi-proprietary system, MS discontinued MODI as of office 2010, and even if you don't mind not upgrading (I don't), the 32 bit MDIW print drivers don't work on 64 bit versions of Windows. That, combined with much better FOSS PDF support these days, if I had to make the same choice today, I'd probably go with PDF.
But no, you don't need any fancy document imaging package to store your files electronically. I would even suggest not using a document imaging package for similar reasons to my regrets about MODI - It may have the best features ever today, but will it still work 15 years from now? If you do need one, I'd say go with a package that only helps you manage the (stored in-file) metadata, so if you need to change at some point, you don't need to scramble to figure out how to export all your data.
One last comment on file sizes - 300dpi black-and-white CCITT4 encoded files will look great for archival purposes and take up next to no room (under 50k for clean typewritten pages). When you start scanning at 600dpi in greyscale or color, expect your documents to take up 2-4MB+ per page... Which may not sound all that big, but after roughly a decade of scanning everything I would otherwise have stuck in a filing cabinet, my "digital credenza" currently contains somewhere around 20k pages spread through 4400 files - Which at 3MB per page would weigh in at 60GB. As it stands now, I can just barely still fit a full backup on a single DVD (but will probably need to upgrade that to burning it to a BR disc by the end of the year).
posted by pla at 1:07 PM on September 1, 2012 [1 favorite]
Reading your followup comment, I would point out that Win7's built-in search will check PDF metadata, no 3rd-party software required.
posted by pla at 1:09 PM on September 1, 2012
posted by pla at 1:09 PM on September 1, 2012
If all your files are OCR'ed pdfs and you have Acrobat, you can use "search" (not "find") in Acrobat to search through all pdfs.
posted by prenominal at 1:28 PM on September 1, 2012
posted by prenominal at 1:28 PM on September 1, 2012
Response by poster: As I said, I know Windows has search, but I don't like the interface much. Among other things, Windows by default can also only tag certain types of files, and unless there's some really nonobvious way of enabling it, that doesn't include PDFs. Yes, I know I can change the file names. I want to be able to do real tagging. If I could get that, the internal search might be fine. Otherwise, it's not going to meet my needs.
I do appreciate the other advice, but I'm already very familiar with scanning documents and handling all of that side of things, I spent years doing it at work, but the software we used there isn't really compatible with home use even if it wasn't really expensive. I've already identified what I want in the way of features, I'm just trying to find something that has those features.
posted by gracedissolved at 1:31 PM on September 1, 2012
I do appreciate the other advice, but I'm already very familiar with scanning documents and handling all of that side of things, I spent years doing it at work, but the software we used there isn't really compatible with home use even if it wasn't really expensive. I've already identified what I want in the way of features, I'm just trying to find something that has those features.
posted by gracedissolved at 1:31 PM on September 1, 2012
Windows Search is able to dig out arbitrary text from PDFs for me.
I hemmed and hawed about doing this for ages, and ended up homebrewing a system based on PDFBeads and Tesseract OCR. This quickly and automatically produces a "good enough" OCR layer behind the scanned text, so I can type "bankname may 2012" and it will find a particular bank statement. I haven't had to futz with particularly complex filing structures (a folder for each document types, with dates in the file names) and I let the OS do the searching.
PDFBeads produces tiny files (it uses JBIG2), often less than 10kB/page at 300 dpi. I really wish I'd done this sooner.
posted by scruss at 1:49 PM on September 1, 2012 [1 favorite]
I hemmed and hawed about doing this for ages, and ended up homebrewing a system based on PDFBeads and Tesseract OCR. This quickly and automatically produces a "good enough" OCR layer behind the scanned text, so I can type "bankname may 2012" and it will find a particular bank statement. I haven't had to futz with particularly complex filing structures (a folder for each document types, with dates in the file names) and I let the OS do the searching.
PDFBeads produces tiny files (it uses JBIG2), often less than 10kB/page at 300 dpi. I really wish I'd done this sooner.
posted by scruss at 1:49 PM on September 1, 2012 [1 favorite]
Canon's multi-purpose printers include a software that scans to searchable PDFs right away. They seem to be perfect for your use. Check Canon Support whether the individual printer in question includes the software called MP Navigator Ex.
posted by oxit at 2:02 PM on September 1, 2012
posted by oxit at 2:02 PM on September 1, 2012
Best answer: Google Drive will do this for you very painlessly. Scan your documents to pdf. Save them to your local Google Drive folder. Google will sync them to the cloud and OCR them automatically. You have to be online to search inside multiple documents simultaneously, but you will always have them locally.
You can organize documents into folders, tag them, and add notes. Searching is a breeze.
posted by nedpwolf at 3:13 PM on September 1, 2012 [2 favorites]
You can organize documents into folders, tag them, and add notes. Searching is a breeze.
posted by nedpwolf at 3:13 PM on September 1, 2012 [2 favorites]
Here’s a great way to decrease the file size of PDFs, while still being able to OCR them. It’s Adobe’s ClearScan technology, something I discovered a couple of years ago when Acrobat 9 Professional came out. It’s still in Acrobat X, too.
Yeah, I know, you have to spend the big bucks/quids/euros on the pro version of Acrobat, so it’s not a cheap solution for someone who doesn’t already have the pro version of Acrobat, or else Adobe’s Creative Suite (where it comes bundled).
As an example, ClearScan OCR can take 50 or 75 pages of a magazine that’s been scanned in color at 600 dpi—resulting in a file size of 3 gigabytes or so—and via ClearScan reduce it to 6 megabytes.
How ClearScan works is by analyzing the page, performing OCR, and then everywhere there is text, removing the bitmap versions of the text and replacing them with outline fonts generated on the fly. Yes, it actually synthesizes outline fonts of everything that it recognizes as text on the page. Baskerville looks like Baskerville, Myriad looks like Myriad, Gotham looks like Gotham, etc. Now, if you were to enlarge the PDF file to 1000%, you’d notice that the outlines aren’t perfect, and they’re a little misshapen. But if you print out the file on a regular desktop printer, the results are quite acceptable. Images are downsampled using (I think) the JBIG2 compression scheme, and you have the choice of keeping or converting the images to 600, 300, 150, or 72 dpi.
For best results in creating readable and good-looking versions of the outlined fonts, you have to start with 600 dpi scans—300 dpi scans will give you misshapen characters. My mode d’emploi is to scan a bunch of things at 600 dpi, and then use Acrobat’s “Recognize text in multiple files using OCR” command to batch-convert those files all at once (overnight works for me).
I used to use Acrobat’s “searchable image” OCR in previous versions, but that left me with magazine scans of 50 or 75 megabytes at 300 dpi. I like whittling my new 600 dpi ClearScanned magazines down to 6 or 10 MB much better. :-)
But, yeah, I know—if you’re trying to do this inexpensively, this whiz-bang technology doesn’t come cheap. It’s too bad Adobe prices the pro version of Acrobat as high as they do. :-(
posted by kentk at 8:11 PM on September 1, 2012
Yeah, I know, you have to spend the big bucks/quids/euros on the pro version of Acrobat, so it’s not a cheap solution for someone who doesn’t already have the pro version of Acrobat, or else Adobe’s Creative Suite (where it comes bundled).
As an example, ClearScan OCR can take 50 or 75 pages of a magazine that’s been scanned in color at 600 dpi—resulting in a file size of 3 gigabytes or so—and via ClearScan reduce it to 6 megabytes.
How ClearScan works is by analyzing the page, performing OCR, and then everywhere there is text, removing the bitmap versions of the text and replacing them with outline fonts generated on the fly. Yes, it actually synthesizes outline fonts of everything that it recognizes as text on the page. Baskerville looks like Baskerville, Myriad looks like Myriad, Gotham looks like Gotham, etc. Now, if you were to enlarge the PDF file to 1000%, you’d notice that the outlines aren’t perfect, and they’re a little misshapen. But if you print out the file on a regular desktop printer, the results are quite acceptable. Images are downsampled using (I think) the JBIG2 compression scheme, and you have the choice of keeping or converting the images to 600, 300, 150, or 72 dpi.
For best results in creating readable and good-looking versions of the outlined fonts, you have to start with 600 dpi scans—300 dpi scans will give you misshapen characters. My mode d’emploi is to scan a bunch of things at 600 dpi, and then use Acrobat’s “Recognize text in multiple files using OCR” command to batch-convert those files all at once (overnight works for me).
I used to use Acrobat’s “searchable image” OCR in previous versions, but that left me with magazine scans of 50 or 75 megabytes at 300 dpi. I like whittling my new 600 dpi ClearScanned magazines down to 6 or 10 MB much better. :-)
But, yeah, I know—if you’re trying to do this inexpensively, this whiz-bang technology doesn’t come cheap. It’s too bad Adobe prices the pro version of Acrobat as high as they do. :-(
posted by kentk at 8:11 PM on September 1, 2012
I'm surprised no one has mentioned Paperport yet. It comes with a number of scanners and can index your files within its system. They're emphasizing "in the cloud" for the latest version of their software, but as far as I know, you can run it entirely on your own hardware and ignore the cloud features.
Similarly, the Fujitsu Snapscan comes with document organizing software.
You might also find some useful ideas doing a search for free document management software.
posted by kristi at 5:25 PM on September 2, 2012
Similarly, the Fujitsu Snapscan comes with document organizing software.
You might also find some useful ideas doing a search for free document management software.
posted by kristi at 5:25 PM on September 2, 2012
Response by poster: I already know the scanner I'm getting and the software it comes with doesn't include anything like this. I think Drive is going to be the best option here, now that I've figured out the tagging part of it, along with importing things with Acrobat. (Which I already know how to use, but thank you, kentk, because I didn't know just what order of magnitude that ClearScan was better!)
The tagging thing, as I mentioned, is a big part of what I was looking for, and this seems like the best option right now that actually has that as a part of it. If anybody finds anything else that might work more locally, I'd be happy to hear it, still.
posted by gracedissolved at 6:59 PM on September 3, 2012
The tagging thing, as I mentioned, is a big part of what I was looking for, and this seems like the best option right now that actually has that as a part of it. If anybody finds anything else that might work more locally, I'd be happy to hear it, still.
posted by gracedissolved at 6:59 PM on September 3, 2012
This thread is closed to new comments.
I highly recommend a scanner with a sheet feeder. My Fuji ScanSnap 510 has made scanning a breeze.
posted by brianogilvie at 12:54 PM on September 1, 2012