New scanner, same settings, PDFs 4x bigger. WTF?
March 23, 2011 1:38 PM   Subscribe

New network scanner/copier. Scanned files (PDF) are 4x larger than old model for same settings. What? Why?

We had a Canon c5185 printer/copier/scanner. We're a mostly paperless office; what paper we do handle gets run through the scanner and sent to a network share, or delivered to our desks via email from the copier as PDF files/attachments.

It was recently replaced with a Ricoh c6501. Nice unit, but there's an issue about the size of the PDF files that result from scanning paper documents. WAY too big.

45 pages in B&W at 300 dpi = 4.6MB (vs 1.3MB on the old unit)

Same pages scanned in color at 300= 16MB !!!

This is way too big - it's choke our email system and our storage needs are going to balloon. The copier vendor isn't giving me any help in explaining why the same pages scanned at the same resolution are coming out 4x larger in size on this new scanner.

I have discovered that the old Canon used to deliver its PDFs as already text-searchable, but the Ricoh does not. That means that the Canon had some onboard image processing that would run the PDf through OCR before delivering it. Contacting Ricoh, none of their units do OCR internally.

We already have the PDF compression settings turned up to the highest, but our scans are still 4x larger than they should be. And they're not even OCR'd - which I would think would make the file larger, no? I know we can shrink the file size and do OCR manually using Acrobat, but nobody wants to do extra steps on every piece of paper we handle, especially when the old scanner did it automatically.

Has anyone had this issue with Ricoh or other scanner / multifunction devices, where the scan files you get are ridiculously bloated?
posted by bartleby to Technology (9 answers total)
No experience with the copier myself. But from reading your question, I think you've answered it already.

If the old copier did OCR, than the PDF 'guts' were letters/symbols/glyphs, rather than graphical data.

ie the new copier is sending you a picture. The old copier sent you text [when it could] with embedded pictures.

(Sorta like typing a letter in a text editor and saving it versus using MS Paint to put the same text on a canvass. The text editor letter will be tiny, the paint bitmap huge, even though the content is the same)
posted by k5.user at 1:46 PM on March 23, 2011

PDF is a sort of container for images (and text). It's possible that the image format and compression rates that the Ricoh uses are different from what the Canon did. Are they color images? Grayscale? Black and white?
posted by babbageboole at 1:47 PM on March 23, 2011

OCR would make it smaller.
posted by prenominal at 1:47 PM on March 23, 2011

One idea: can you set the Ricoh to scan a single page into a TIF file? Then use the same settings (resolution, color settings, etc) except scan the page to PDF. If the PDF is significantly larger than the TIF file... then the Ricoh is not saving PDFs correctly. Sadly, your only recourse is to return the Ricoh and get a better pri-cop-anner.
posted by babbageboole at 1:53 PM on March 23, 2011

45 pages in B&W at 300 dpi = 4.6MB (vs 1.3MB on the old unit)

Same pages scanned in color at 300= 16MB !!!

The black-and-white scans were likely 256-shades, or 8 bits per pixel; color is probably 32-bit color -- or 4x the size of a black-and-white scan. There's your 4MB vs 16MB.

Ask your scanner service guy to turn down the color depth of your scans -- you might even be able to set it to 1-bit, which will look like a fax (nicer than a fax at 300dpi, but still faxlike), but will be about 20K-50K for a regular sheet of paper. Or, turn your color depth down to 1bit and your resolution up to 600dpi, and you'll get some nice sharp text but still pretty small files.

If you're scanning in anything higher than 8-bit, you can turn down your DPI and still get a readable document, which will reduce the filesize. 200dpi is less than half the size of 300dpi -- 40,000 pixels in an inch-square versus 90,000 pixels in an inch-square, so you're using 50,000 less bytes per inch.

An OCR'ed file uses only 8-bits per character, versus per pixel; in an image, one character has a whole lot of pixels in it. OCR is much, much smaller than an image, and doesn't add much to the size of a image-plus-searchable-text PDF.
posted by AzraelBrown at 2:27 PM on March 23, 2011

Aha. I had been working from the incorrect assumption that adding an OCR text data layer to the PDF container would add to the size of the file. But you're all correct - if I run these PDFs through OCR in Acrobat, I get a smaller file. Thanks!

Now I just gotta figure out how to automate the OCR process at the sending end, because they're not going to be willing to OCR & reduce each file in Acrobat after they've received it.
Not to mention the fact that we're going to end up with GBs worth of scan attachments clogging up the mail system each day.
posted by bartleby at 2:40 PM on March 23, 2011

I took an ordinary letter-sized printout of a Word document with default margins, and scanned it on our Ricoh c4500. I found one page to be about 110kB for black and white, 300dpi, 450kB for 8-bit greyscale, 480kB for color (should be noted that the original document was all b/w text, no color, no graphics).

What you found is in line with what our Ricoh produces. Off the top of my head, 110kB does not seem so bad for one page of scanned text, but I wonder what it is about your Canon that can do so much better. OCR may very slightly increase the size of your document, (often the OCR is added as transparent text on top of the original scanned image) unless somehow the OCR actually replaces the image it finds with visible text. I would expect that the optimization using Adobe Acrobat just uses better algorithms to compress the images, plus some smart preprocessing to make the document more compressible (like deskewing and despeckling).
posted by Maxwell_Smart at 2:58 PM on March 23, 2011

See if there is a PDF-MRC option.
posted by blue_wardrobe at 6:10 PM on March 23, 2011

Copier dealer here. The difference in file size is almost certainly coming from the Ricoh's lack of OCR. As others have said, without OCR, even a page of just text is basically treated as a giant photo, which makes for a large file.

There are some additional settings you might try (Text, instead of Text/Photo mode; lowering the resolution, trying B&W instead of Grayscale if that is an option), but really this boils down to your Ricoh salesman failing you as a client. He should have done his homework on the full capabilities of your Canon, interviewed key users or whoever is in charge about how the machine is used, and then made sure the model he's proposing is up to snuff and will deliver the same functions and usability that your prior equipment had.

Especially if having text-searchable scans was something you relied on as part of your job, I would really press the Ricoh dealer to remedy the situation to your satisfaction. The 6501 is a beast of a machine, and if they can't directly add OCR capability to the machine, they ought to at least be able to make it a simple part of your workflow so that your resulting scans are a more manageable size. You spent a pretty penny on a machine that large; if your satisfaction with the equipment's performance isn't important enough that they'll try their hardest to make it right, I would have serious worries about how well they'll take care of you down the road.
posted by xedrik at 6:47 PM on March 23, 2011

« Older What kind of hiking/gym shoes should I buy?   |   Time for a Battery Change, but... how? Newer »
This thread is closed to new comments.