Join 3,512 readers in helping fund MetaFilter (Hide)


Turning thousands of loose-leaf papers into dozens of PDFs
April 5, 2012 5:27 PM   Subscribe

I have a few thousand pages of documentation I'd like to digitize and organize, eventually making a series of PDFs out of them. Can you help me make this task less painful?

My company has tons and tons of game documentation in binders that doesn't exist anywhere else. I've brought up the issue of game preservation (see related topics on the blue) and somehow ended up with a gigantic pile on my desk that spans the last 15 years. I'd like to archive all of it, and then organize parts into smaller PDFs (i.e. a PDF of level designs for game X, another for concept art in game Y).

Here's some specifics:
* All the docs are looseleaf 8.5x11" or less
* Our office scanners don't have document feeders, so I need to purchase one myself - any recommendations? (It's from my own funds so I can't go too expensive)
* I have access to lots of common design and art software at work (such as Adobe Creative Suite)
* Lots of the documents are out of order, so they'll need to be organized and categorized after scanning but before being put into PDFs. Is there any software that can help me with this part?
* There's lots of duplicates that I probably won't catch until they're scanned. Again, any software that might help me automate this?
* Some pages are one-sided, and others are double sided, all mixed together
* Some pages have highly detailed, colorful drawings or diagrams, while others are just printed Word documents. Is there a good way to balance the scanning quality to suit both without file sizes becoming totally unreasonable?

Does anyone have any ideas on how to make this easier on me?
posted by subject_verb_remainder to Media & Arts (8 answers total) 10 users marked this as a favorite
 
It's funny because I remember I saw that neat scanner organizer commercial and saw it at office depot. I believe this one is a keeper. I haven't tried it out but I think they have a trial version. Go on google and check the reviews or youtube it. Seems like a great investment to me though. Hope this helps.
posted by ates at 5:49 PM on April 5, 2012


Fujitsu ScanSnap scanners are second-to-none for ADF scanning.
posted by fake at 5:51 PM on April 5, 2012


I had a previous-generation ScanSnap and it was great for this kind of thing: you feed it paper and it generates searchable PDFs. It handled double-sided documents and weird papers (notebook/legal pad/steno pad) fine but sometimes experienced a little image bleed-through on double-sided thin paper.

Lots of the documents are out of order, so they'll need to be organized and categorized after scanning but before being put into PDFs. Is there any software that can help me with this part?

I ultimately found that organizing the actual paper was faster and easier than messing with the PDFs, even with Acrobat Professional. Paper's pretty easy to shuffle around.

Some pages have highly detailed, colorful drawings or diagrams, while others are just printed Word documents. Is there a good way to balance the scanning quality to suit both without file sizes becoming totally unreasonable?

You can select several levels of quality and bit depth (b&w, grayscale, color) but if you have one color page in an otherwise b&w document the ScanSnap workflow will force you to scan the separately and put them together in Acrobat.
posted by pullayup at 5:54 PM on April 5, 2012


Right off the bat before buying anything. Run a multi-page test

Other things to keep in mind........
- Test scan a sampling of the files that represent the variety of document types.
- Scan at different resolutions.
- Scan to raw image. You can assemble them easily but it is annoying to break apart PDFs.
- Play with file sizing.
- Print some of the samples and make sure that your scan resolution works for your printers.
- Create directory structures that you can easily track.
- Come up with a naming convention that are easy to enter, but descriptive. Be consistent.
- Consider overall archive size for all documents and look at your current capacity.
- Get decent backup software if you don't have it.
- Get decent scanning software that allows for OCR, front/back pages, mirrored pages etc etc.
- Try to make the scan the biggest part of the process per page as opposed to scanning, then editing in another app. Cropping, color balance, de-skew settings are very important and easily accomplished at the scanning point.
- Make decisions on all of the technical aspects and come up with a policy you will apply to all documents to be scanned.

Taking the time to consider options in a methodical way will save you a whole boatload of grief later on down the line.

Oh and the NeatScanner is a fine little tool, but I really do not think it is appropriate for the job as described.
posted by lampshade at 6:04 PM on April 5, 2012 [2 favorites]


There are companies that do this.
posted by snowjoe at 6:18 PM on April 5, 2012


ScanSnap s1500 is the one, with one exception: it uses its own proprietary driver, not TWAIN, and so it can use only its own software. It is sold in specific PC or Mac versions. It was slow to support Windows Vista. (The Vista drivers work with 7, but I don't know about 8 and there could be another problem at that time.)

More universal is the Epson GT-S50. Just about as good and uses the normal TWAIN driver, so it can use any program that supports normal scanners, including the programs it comes with and normal Windows programs, and the one model supports both PC and Mac.

Both of the above have very similar even faster siblings at about double the prices.

The NeatDesk desktop machine is OK but it is slower by nearly half. It has better software for organizing receipts and addresses/contacts, which doesn't seem to be your goal. (I do not mean the little NeatReceipts one-page scanner here.)

With any of these, if you don't need to keep it, when you are finished you can sell on ebay and get 75%+ of your money.

DocumentSnap is a site with lots of articles about arranging various scanning tasks. It has a strong ScanSnap bias, with Amazon affiliate links for that scanner and accessories. But its articles and emailed newsletter are worth it. Nearly all of its information can roughly apply to the Epson scanner. (I have no connection to them, except as free subscriber.)
posted by caclwmr4 at 7:43 PM on April 5, 2012 [1 favorite]


Thanks! Looks like ScanSnap is the way to go (I'll look into its software to make sure it'll let me do what I want) and sell it once I'm done.

I'm leaning towards saving everything to raw images and assembling PDFs afterwards, and using the adobe bridge to help review and organize stuff since it lets me tag images with keywords.

pullayup:
I had a previous-generation ScanSnap and it was great for this kind of thing: you feed it paper and it generates searchable PDFs.
How are the PDFs searchable? Does it recognize text in images or do you mean something else?

I ultimately found that organizing the actual paper was faster and easier than messing with the PDFs, even with Acrobat Professional. Paper's pretty easy to shuffle around.

That's a good point, so I can definitely do this... to an extent. I have, for example, four binders of documentation on one game from four different people and I'd like to return their binders with the content intact, but I can definitely organize within the binder prior to scanning.
posted by subject_verb_remainder at 1:03 PM on April 6, 2012


[I realize this is from yesterday, and might not even get read by the OP, but what the heck]

I'm in the middle of a project that is doing something similar.. Graduating from university 2 decades ago, my old college notebooks (mostly 3-ring binders), bluebooks, handouts, and other detritus needed to be scanned in electronic form.

I agree with the Scansnap. The 1500 models are fast, and do double-sided. They auto-create PDF documents, and the software it comes with does OCR (ABBYY FineReader, a restricted version that only works with PDFs created with the Scansnap).


So, the system that I've figured out over the past few months is something like this-


- I divide up the paper as best as I can, by class

- I scan the whole stack for that particular class, watching the feeder (sometimes it takes in 2 pages, but the scanner is smart enough to know when this happens. It is also very possible to overload the feed tray, so one must be there to add in the next stack, we needed)

- The software forms up 1 HUGE PDF with everything in, and saves it in a Dropbox folder, which instantly puts it on all of my computers

- At this point, the PDF is more or less OCRed searchable (it is astounding how well it does even with CRAPPY handwriting (as in , mine)), but it isn’t finished. I manually feed to to ABBYY now.
-- This is important, as a fair number of PDFs created won't survive the ABBYY process, which means the Scansnap software somehow messed up the PDF. If it fails, I delete the big PDF and start over. It always does work… eventually.

- The file is now moved into a directory hierarchy that works well with the way my mind works (e.g., 'Political Science/368 - Political Use of Military Force'. At this point, since the file is in Dropbox's cloud, after one more time looking at it in Acrobat, I recycle all of the paper, since silly things like my student ID (aka 'Social Security Number') are all over the place, etc. (thank goodness for the secure recycling dumpsters at work!))

- Also moved into this directory is a textfile checklist of what needs to be done yet (for my stuff, it is usually removing the blank 3-ring binder pages (I only wrote on the 'front' side, but the scanner scans both), separating out the syllabus, the notes, the Kinko's pack into their own PDF files). As I do these takes, I 'check them off' in the textfile. This lets me focus on a single 'job' for multiple classes, then move on to the next tasks.

Naturally, this requires Adobe Acrobat (or another PDF editing package).

I have gone back and edited a number of the courses, and this system is working well. As of right now, I am keeping the large PDF file that is created (post-ABBYY only – the ABBYY versions as as much as 30-35% smaller in size), as well as the broken out pieces.


Since the files are in Dropbox, I can edit them at any time. It is odd, I will actually take a break from my highly technical workplace to edit apart some medieval Japanese history course I took as a sophomore – it is an amazing way to get my mind off of whatever technical glitch is getting my attention today and thinking about something else.


Is this system perfect? Not at all. There are give-and-takes. Bluebooks are a big PITA – They are booklets, therefore wind up being scanned with pages very much out of order (after the rusted staples are removed…!). These require immediate editing, as I am going to recycle the paper products, I need the bluebook there, in my hands, reassembled as a booklet to make sure that the pages wind up in the correct order.


In your case, if you can just make piles of all of your stuff by game/version/OS/whatever, and then scan the lot, it will help you out a great deal. Some have recommended reordering paper, and I agree. I would caution, however, that it is probably more important to get the stuff scanned into a PDF in a searchable form that is mostly in the correct order and together, even if you have to re-arrange pages, than to spend too much time preparing to scan, by separating everything out. The scansnaps are quite good at identifying language and orientation, meaning it doesn’t even matter if the stuff is upside down, the PDF will look perfect.


Sorry for the disorganization and ramble.. Friday afternoon at work, I needed a distraction, and it is a holiday weekend, so I'm one of the few people here.


Best of luck!
posted by aarin at 3:09 PM on April 6, 2012 [2 favorites]


« Older What is the difference between...   |  I need a fairly cheap, hard to... Newer »
This thread is closed to new comments.