scanner with document feeder, b/w images encoded as GIF or PNG or equiv.
August 12, 2013 9:40 AM   Subscribe

I need to scan a bunch of documents that are mostly simple black/white or grayscale text, lines, etc. I'm probably going to just buy a sheet-fed scanner, and I'd like recommendations for that. I'm hoping for a solution that will automatically generate PDFs with the included scanned images encoded as GIFs or PNGs or similar, since they will mostly be black/white or grayscale, which compresses much more cleanly with these formats.

Hi there,

I need to scan a bunch of documents that are mostly simple black/white or grayscale text, lines, etc. I'm probably going to just buy a sheet-fed scanner, and I'd like recommendations.

(I think I'd prefer a one-function scanner, not an all-in-one; I like my current printer just fine, and don't want another huge box if I can avoid it.)

** If possible: ** Since the documents aren't as visually complex as photographs, they will compress much better, and be clearer for smaller file-sizes, if I can use a limited-palette encoding format like GIF (or PNG, I think, though I don't know much about PNG). I'd like these automatically embedded in Acrobat files, since some will be multiple pages. I don't want to have to fiddle with each one individually, but could set the color palette for each batch, for example.

I know this last item is a software matter rather than a hardware matter, but I'd rather not spend hours and hours setting this up -- and I'm not sure how practical it is anyway.

I'm also willing to consider scanning services in my area (Chapel Hill/Durham NC), but I think it's probably worth it to me to just set this up myself and scan everything over a couple of months.

Can anyone recommend a specific scanner and/or software to make this happen?

Thanks!
posted by amtho to Computers & Internet (16 answers total) 1 user marked this as a favorite
 
I love my Fujitsu ScanSnap. I scanned several large filing cabinets worth of documents with it and it was a dream to use.

I haven't played with all the settings so I can't confirm whether it can do exactly what you described in your post, but it's pretty configurable so if any scanner can do it then I would expect the ScanSnap can.
posted by Jacqueline at 10:05 AM on August 12, 2013


Response by poster: I should have mentioned: I'm running Windows (XP, and 7).

(Thanks for the suggestion so far).
posted by amtho at 10:10 AM on August 12, 2013


ScanSnaps are great for scanning to searchable PDF.. but according to the settings for ScanSnap manager your choices are PDF or JPEG. No GIF or PNG. Sorry.
posted by devnull at 10:10 AM on August 12, 2013


Response by poster: I guess what I'm hoping for is scanning to PDF, but customizing settings for the PDF, so that the PDF will contain GIF/PNG rather than containing only JPEG images.
posted by amtho at 10:13 AM on August 12, 2013


Let's separate your task into two parts: scanning and document creation, since the two are separate. If you want decent scanning with a document feeder, I've been very happy with Fujitsu scanners in the same form factor as the fi 6110. They're nice because the scanning is snappy and the document feeding works pretty well and they do duplex. They offer TWAIN and ISIS drivers, so just about everything will be able to consume the output of the scanner.

As for PDF creation with specific compression settings, that's another question entirely.

Just because you save a scan as a PNG doesn't mean it will end up with FLATE (the typical PNG compression) in the output PDF. That has everything to do with the conversion software.

As an aside, PNG is not a terrific document format, being single-image per file.

That scanner comes bundled with a piece of software to generate PDF (among other formats). There is a slider to set the compression level, but again, there is no indication as to what specific format will be used, since that's smudged over in the UI. It might be configurable by manually editing a profile.

The scanner also comes bundled with a version of Adobe Acrobat Standard, which includes a create from scanner. In my version of Acrobat, you can absolutely set it to use LZW (FLATE) and CCITT compression.

I'd love to tell you that the company I work for has a product suite that let's you do exactly what you want in about 3 dozen lines of code, but (1) you have to be able to code in .NET (2) it's a Windows only solution and (3) It will cost way more than your scanner (the feature set you need isn't in our free edition).
posted by plinth at 10:14 AM on August 12, 2013


I can't find out to do that, so I doubt you can. But have a look, maybe you can find it.
posted by devnull at 10:16 AM on August 12, 2013


GIF image encoding isn't possible in PDFs, but PNG is. As for seeing if a particular scanner would use PNG instead of TIFF, JBIG, or another encoding, I imagine that would be hard. However, the Adobe Acrobat software definitely lets you specify that you want images to be stored with PNG encoding.

I'll defer to plinth, because he has more experience than I as evidenced from the Xerox JBIG bug thread on MetaFilter last week!
posted by zsazsa at 10:18 AM on August 12, 2013


Best answer: I looked at the SnapScan manual and FWIW, it doesn't look like it bundles software that does anything other than create a PDF with no other options.

Maybe if I have time this afternoon, I'll write that application. I think I can do that without a conflict of interest.
posted by plinth at 10:21 AM on August 12, 2013 [1 favorite]


Fujitsu ScanSnap works great for me for this purpose. The non-OCR'd PDFs it generates are small enough for my purpose. It has a compression slider but it doesn't let you choose between different internal image formats. I'm using it on a Mac, so I can't say how the Windows drivers are. I can say that I love this scanner.
posted by snuffleupagus at 10:27 AM on August 12, 2013


I'm happy to examine one of the PDFs the ScanSnap generates, to figure out how it's storing page images w/a given profile setup. If that would be helpful, and if I can figure out how to do it.
posted by snuffleupagus at 10:32 AM on August 12, 2013


I know you said that you weren't after all-in-ones, but my (overly large) Epson WorkForce scans directly from ADF to PDFs in its SD card slot. B&W PDFs are encoded as G4 TIFFs. The setup is minimal. Colour PDFs, unfortunately, are JPEG from this device, unless you scan directly through an application.

I bought it to be a cheaper alternative to a ScanSnap. I rolled my own application to do OCR and encode (semi-carefully) as JBIG2 PDF afterwards.

pdfimages will extract bitmaps embedded in most PDFs (you'll probably want a front-end that encodes in more useful image formats). If you really just want to render a PDF to the bitmap format of your choice, Ghostscript is still your friend.
posted by scruss at 10:32 AM on August 12, 2013


Best answer: I'm happy to examine one of the PDFs the ScanSnap generates...if I can figure out how to do it

This is trivial (ish).

Create a single page document with the settings you want.
Open the document in a decent text editor (I use Notepad++ for this kind of task).
Search for "/DCTDecode". If you find that, the compression is JPEG - too bad.
Search for "/CCITTFaxDecode" (black and white only). If you find that - hooray!
Search for "/JBIG2Decode" (black and white only). If you find that: too bad;
Search for "/JPXDecode". If you find that, that's JPEG2000, also too bad.
Search for "/RunLengthDecode", if you find that - yay? It's an awful compression, but it's not lossy.
Lastly, search for "/LZWDecode" and "/FlateDecode". If you find those - keep searching. If all you find are either of those - yay! It's using PNG-type compression. This should be last because the presence of /FlateDecode doesn't preclude another compression. /FlateDecode gets used commonly on other types of embedded data since it's lossless. You will see it used, for example, on XMP data or page marking content.

I say trivial-ish because there is a non-zero chance that the document is encoded with a flavor of PDF that might hide some of those things in a misfeature called an object stream.
posted by plinth at 11:02 AM on August 12, 2013 [1 favorite]


Best answer: Running pdf-parser against a ScanSnap-produced B&W PDF at medium compression shows it using FLATE. I can pastebin the whole output if it would be informative.
posted by snuffleupagus at 11:03 AM on August 12, 2013


Thanks plinth, that's good stuff to know. I'll check for the decode at the end of the PDF.
posted by snuffleupagus at 11:04 AM on August 12, 2013


Best answer: ...and it's FlateDecode throughout the formatting information I can see in TextMate. First, last, and none of the others present.
And a scan of a two sided Apple store receipt at these settings comes out to 508kb.
posted by snuffleupagus at 11:09 AM on August 12, 2013


I have a Doxie One that's good for what you describe.

Plusses:
* Doesn't have to be connected to a PC to scan. It scans to an SD card which you take out and put into the PC.
* Small, portable and light.
* Can put rechargeable batteries in it to use it without being plugged in at all.
* Software is easy to use and produces nice OCR'd PDFs.
* Much less expensive than Canon and Fujitsu.
* Excellent user reviews.

Minuses:
* Fairly slow, especially if you have a big initial batch.
* Getting paper to go straight through is a little fiddly if the paper isn't perfectly flat to begin with.
* I worry about the long-term quality of the scanning mechanism. Worried that it will break.

I don't know how it addresses the GIF/PDF requirement, but I wouldn't see those as necessary given the current prices of storage and bandwidth.

I would recommend the Doxie, especially given how much less it costs than comparable Canon and Fujistu devices.
posted by cnc at 12:00 PM on August 12, 2013


« Older How to stop being nosy-ed(?).   |   Help with expectations and recommendations heading... Newer »
This thread is closed to new comments.