rotating PDF pages in Photoshop changes filesize from ~230KB to ~1.4M
November 2, 2020 10:30 AM   Subscribe

My school has gone 100% online webinar. The entire textbook was scanned and divided into PDF chapters. So far, so good. However, the scanning was really really sloppy and I need to adjust the rotation of almost every page. Tedious and time-consuming, but not difficult, right? Wrong....

A whole chapter is about ~4.5M. An individual page, removed in Preview (OS X) and saved on its own is about ~230KB.

Preview only allows rotation at 90-degree intervals, and each of these pages must be manually rotated a few degrees this way or that. I have Photoshop, so I opened a single page (~230KB) in PS to see what I could do.

That single page immediately became ~1.4MB in Photoshop. I went in and reduced image size from 300dpi to 72dpi, as I would only be using the doc on my screen to share with my students. But that means when the doc is 100%, it's tiny. And still ~1.4MB.

How do I maintain file size on each of these pages while adjusting their rotation, and still keeping them "magnifiable" on screen without pixelization?

I'm obviously missing some basic info on this.
posted by tzikeh to Computers & Internet (13 answers total) 1 user marked this as a favorite
 
Do you have access to Acrobat? Not the reader, the full program.
posted by mr_roboto at 10:34 AM on November 2, 2020 [3 favorites]


I think the thing which will possibly help more than dpi, which generally affects only presentation and not file size in a bitmap environment like photoshop, is to reduce the color palette. If the pages are black and white without illustrations or photos, I'd try:

1) Desaturate the image to get to a grayscale image (image -> adjusments -> desaturate)
2) Change to indexed color, I'd see if 16 colors works, if not, try 32 (image -> mode -> indexed color)
3) To take advantage of the indexed color, a PNG might be a better choice than a JPG when saving

Do the above after rotating!

The other thing which will affect the file size is not DPI, but actual pixel dimensions of the image. If the images are wider than 1200 pixels, I'd make them that width, letting the height be auto-adjusted (image -> image size). Again, do this BEFORE the indexed color operation.
posted by maxwelton at 11:14 AM on November 2, 2020


The (free) tool to process scanned books quickly is Scan Tailor, but it may not run under Mac OS. It also only accepts bitmaps as inputs, so you'd also have to extract the scanned bitmaps from the PDF files. I don't know how to do that outside a command-line environment, where I'd use pdfimages. This way would keep the resolution intact, but allow straightening, page cleaning and margin cropping. With the right tool (depends on the page format and contents) you could end up with a smaller file per page using Scan Tailor.

If the scans have been OCR'd via Paper Capture, how bad are the few degrees of squintiness? To correct them outside full ($$$) Acrobat Pro would lose any OCR information, and putting it back is tedious.
posted by scruss at 11:23 AM on November 2, 2020


Try tinypdf.com.
posted by Kiwi at 11:27 AM on November 2, 2020


Do you have access to Acrobat? Not the reader, the full program.

No.

1) Desaturate the image to get to a grayscale image

Can't - color is a significant part of the coursework.

The (free) tool to process scanned books quickly is Scan Tailor, but it may not run under Mac OS.

It doesn't.

Try tinypdf.com.

I already have the PDF files, but I have no access to the hard-copy textbook.
posted by tzikeh at 1:30 PM on November 2, 2020


If you can put up a sample page or chapter somewhere, that might help. PDF is a container format, and my guess is that you're somehow doing a "save as" on the files you're modifying saving them in a much larger format. (bitmaps vs. jpeg, e.g.) and that this is hidden by their being re-wrapped in PDF.
posted by mhoye at 2:10 PM on November 2, 2020 [4 favorites]


I've had much better luck with Foxit's PDF stuff than the basic adobe stuff, and I don't want to buy the full adobe suite. You can try using Foxit PhantomPDF to rotate pages and then resave on a free trial, I suspect it will do a better job of keeping things the same as they were before.
posted by JZig at 3:08 PM on November 2, 2020


It sounds like your moving from ocr text to raster. You're probably screwed as I don't know if any tool that allows granular rotation while maintaining ocr text within a PDF.

It sounds like this must be 1000s of pages else it would be scanned again but properly. Why is it that a file only a few hundred megs is problematic?
posted by turkeyphant at 5:17 PM on November 2, 2020


Who provides the scan? That can be a first step toward sanity. Could you ask the publisher for an accessible, digital instructor’s copy? Are you working with a campus librarian with posting course reserves? Is there a department of instructional technology that is working with faculty (and might be able to hook you up with better pdf options via shared supported software or tools within course management software?

My context is campus disability services. We frequently work with scanned textbooks. They may be shared with students from the cloud, and are sometimes in chapters as separate files. Our process is remediation for screen readers, so we run the full version of Adobe. It’s not all smoothness and light, by a long shot, and I haaaate getting the sideways scan when the text is self-evident. Hopefully something above will shake something loose and improve this for you overall.
posted by childofTethys at 4:43 AM on November 3, 2020


Are you comfortable in the command line? There is a tool called "unpaper" that deskews automatically. You can also use unpaper through a different tool, "ocrmypdf," without actually running OCR, to both deskew and optimize file size. Both tools are open source.

On the paid side, I use ABBYY FineReader, which had excellent deskewing, although I'm not sure if deskewing can be done without OCR.

https://github.com/unpaper/unpaper
https://ocrmypdf.readthedocs.io/en/latest/

Send me a MeMail if you want help with these or would like me to run your file through my setup. I use it mainly for public domain dictionaries, which come in all sorts of states and conditions.
posted by Mo Nickels at 8:01 AM on November 3, 2020


If you can put up a sample page or chapter somewhere, that might help.

I don't have any service that could host a PDF page--just images.

It sounds like your moving from ocr text to raster. You're probably screwed as I don't know if any tool that allows granular rotation while maintaining ocr text within a PDF.

They're just scanned pictures of textbook pages, combined by chapter and saved as PDFs. There is no OCR. They're flat scans.

Who provides the scan? That can be a first step toward sanity.

A guy who works the front desk in the office of my very small (as in we have no IT or HR departments) company who was tasked with scanning an entire textbook when classrooms went online suddenly due to lockdown.

It sounds like your moving from ocr text to raster.

There is no OCR involved in this. They are flat scans of a textbook on a low-cost office scanner. The book was laid on a scanner, the page was scanned, the book was turned around, the other page was scanned, he turned the page, and so on.
posted by tzikeh at 11:39 AM on November 3, 2020


Try PDFsam.
posted by coberh at 7:50 PM on November 3, 2020


Like Mo Nickels, I'm stepping up to offer help. Like them, I usually work on public domain texts, but obviously, this one wouldn't be shared anywhere. Memail me if interested. Unless the scans are absolute dirt, I can usually get something acceptable from them. For instance, this one was badly photocopied/duplicated from a smudgy dot-matrix source with actual blobs of printer's ink on the page, and I'd still class it as mostly readable ...
posted by scruss at 9:57 AM on November 4, 2020


« Older Children's hiking/trail boots   |   Masks that don't fog glasses Newer »

You are not logged in, either login or create an account to post comments