PDF page to an image, why is this so hard?
February 16, 2010 12:48 PM   Subscribe

How would I go about recreating the document display functionality of Google Books / Safari Books online? I was expecting to be able to take a PDF source and serve each page as a PNG to the browser. This appears not easy to do? It seems like a simple thing to do, but I'm running into a bunch of brick walls.

I was trying to improve my Javascript skills so I thought it would be neat if I built an application that took my PDFs and rendered them as PNGs, the Javascript would take care of the navigation (for simplicity, I'm ignoring the zoom features). So, I built this nice little Javascript application that displays PNG files and allows you to move through a book in a nice, asynchronous manner. This I thought would be the hard part.

I assumed there's got to be a library out where I do something like GetPage(int page) and it returns the page as a nice PNG file. I figured out how to OCR with PDFBox and I've found plenty of tools to convert a file to a PDF, but nothing to do something as simple as this.

Is there something out there that allows me to do this? I can't believe there's hundreds of ways to render something to a PDF, but not the other way around? This makes me think that I'm not googling the correct terminology. I also don't care if I have to go PDF->Something Else->PNG. I'm agnostic as far as the language is concerned, at this point I'd use a separate platform just to do this.
posted by geoff. to Computers & Internet (9 answers total) 4 users marked this as a favorite
 
The open-source image manipulation tool Image Magick can burst a pdf into individual pngs. However, it's an all-at-once process, not a single page fetcher.
posted by nomisxid at 1:13 PM on February 16, 2010


Open source pdf viewing & processing packages like Image Magick are almost always based upon GhostScript. So use GhostScript directly. Btw, google docs viewer already converts each page of a pdf into separate png files.
posted by jeffburdges at 1:29 PM on February 16, 2010


Ghostscript offers pretty solid png output. GhostScript can interpret only specific page ranges in PDFs using -dFirstPage and -dLastPage, and/or just dump all pages into separate files.
posted by jeffburdges at 1:40 PM on February 16, 2010


Nthing Ghostscript. One suggestion, though: rather than converting to PNG on the fly on a page-by-page basis, it's probably best to render all of your PDFs into PNGs, keep them on the server, and serve them up as static files. This will certainly lighten the load on your server.
posted by zsazsa at 1:54 PM on February 16, 2010


Response by poster: GhostScript is exactly what I was looking for! Now that this is so easy, I'm curious, any idea how Google/SafariBooks are able to do the zoom in/out so easily? I assume it is dynamically generated? There could be an intermediate vector format they use to improve responsiveness, but I can't imagine they'd use static files.

In any case, GhostScript works for my needs. Thanks again.
posted by geoff. at 3:32 PM on February 16, 2010


Safari Books uses Flash, so it's just sending intermediate vector data straight to the Flash plugin. So that means that they're not rasterizing any pages aside from the little book preview images.

As for Google Books, most of their content (actually, all that I've seen) is scanned documents. So that means that they either store the images in full resolution and resize them on the fly as they're served (saving disk space at the expense of CPU time), or just store intermediate sizes (saving CPU time at the expense of disk space). Who knows what they've optimized for; my gut tells me that they want to save CPU time since they want to get you the data as quickly as possible, but then again they have a lot of zoom levels.
posted by zsazsa at 3:54 PM on February 16, 2010


Response by poster: Ah, didn't notice Safari had Flash. They do a good job of hiding it, the controls are all html elements.
posted by geoff. at 4:34 PM on February 16, 2010


I'd personally process and cache page ranges around where the reader's search delivers. I'd handle zoom levels similarly through caching, jumping to presets and processing nearby zooms in parallel or queued.

If you're dropping $20k+ on a server farm, you'd likely process each pdf page separately on different servers, adapting to load by using ranges. I'd imagine this isn't a sound business plan since google already does it this way or better.
posted by jeffburdges at 9:38 PM on February 16, 2010


There's a command line wrapper / Ruby library to openoffice and GraphicksMagick called docāš”split that does exactly this. The source is clean and readable so even if you're not using ruby or don't want to shell out your splitting you could probably rewrite it in your language of choice.

If it were me I'd do all this processing ahead of time (with a background job queue like resque) and store the resulting images on s3. Your viewer will be super quick if you do so, plus if you do all the processing ahead of time, you won't need a huge server farm, just a cadre of happy workers.
posted by Jeff_Larson at 11:23 PM on February 16, 2010


« Older Flea/tick treatment that avoids leather furniture...   |   help me pick a physical goal to work towards, not... Newer »
This thread is closed to new comments.