Need some file recommendations
July 20, 2014 6:30 PM   Subscribe

I'm trying to find the best file type for the larger documents I'm putting on my website. PDFs are just too big. Any suggestions?

There are some fairly large (document) files that I'd like to put up on my website. Due to slow internet speeds I seem to be limited to 10 MB or less. Some other websites I've looked at use DjVu files, which compress nicely but I personally find to be a pain in the neck in terms of usability.

Are there other file formats that are better suited to putting documents online besides PDFs? I've considered breaking the PDFs themselves in smaller pieces; however, the freeware PDF editor I use for some reason makes the constituent pieces even larger, space-wise, than the original file. Anybody have experience with this?
posted by orrnyereg to Technology (9 answers total) 1 user marked this as a favorite
You could RAR the PDFs to break then down into small pieces, but none of them would be readable on their own. Is the issue per-file size or total?

You can also put them on Google Drive and have a viewer frame on your website.
posted by supercres at 6:45 PM on July 20, 2014

PDF's can vary in size enormously depending on how they are created. Perhaps, you can try to use another method or use a tool to compress them after the fact.

Here's an online tool that might be helpful.
posted by nedpwolf at 6:52 PM on July 20, 2014 [3 favorites]

Memail me if you want a pdf-makin' pro (that'd be me) to ask you about your pdf production workflow and run an audit in a heavyweight pdf editor or two. It's almost always possible to reduce pdf file size.
posted by BrunoLatourFanclub at 6:58 PM on July 20, 2014

Is there a reason you can't make them available as plain text?
posted by colin_l at 9:05 PM on July 20, 2014 [1 favorite]

Seconding that it's probably not the PDF format's fault that they are too large; it's very likely your PDFs can be shrunk by adjusting the compression settings when creating them (link points to video for doing this easily in Acrobat).
posted by Aleyn at 9:08 PM on July 20, 2014

If your PDFs are really huge bytes-wise then I suspect they're basically images - like rendered scans - and that you could produce something much smaller with better production practices. It's been a while since I dealt with stuff like that, but for example, the default settings for latex used to make rendered 600 dpi files, which were huge, and only looked nice when printed. If you fooled around with the font settings, you could get it to use a postscript font, and the files would be waaaaay smaller and would look much nicer in viewers also.

If you have some samples of your work I could probably be more specific/helpful.
posted by RustyBrooks at 8:17 AM on July 21, 2014

Thanks, all, for the suggestions! Here's some more info:

Most of these documents are high-quality color scans of historical records. I don't have access to the original records so I can't make my own lower-quality b&w scans. Because they're basically images I can't render them as plain text (also, the documents use Fraktur script which surely must be un-OCR-able).

Here's an example of what I'm talking about. In this case I was able to break up the document; but, obviously, I'd prefer to keep it in one piece. The web editor that I'm using doesn't limit the total size of files uploaded to a page--it's the individual uploading that's the problem, which I think is primarily due to my slow internet.

I'll try your compression suggestions and report back. Am I correct in assuming that TIFF files would only make the problem worse?
posted by orrnyereg at 3:33 PM on July 21, 2014

Ah, yeah, those samples would have been useful along with the question, since it really changes the available strategies.

If it were me, I think I'd probably take an approach along these lines:
1. break the files into individual pictures - TIFF would probably give the best quality, JPG the smallest size with reasonable settings. PNG is probably a very good option also. There are tools available to split PDFs into individual images.

2. I'd probably make a very simple browsing interface - show a page or 2 at a time, with buttons to go forward and back, a place to enter a page number, and probably a table of contents on the left with links to pages. The pages shown would default to much lower res pictures so people could load fast.

3. once you've broken it up into images you can use image editing tools (I'd probably make a shell or bash script and use something like ImageMagick which is a command line tool to mess with images) to make multiple sizes of the images so that people could effectivelty "zoom in" by loading a larger res image.

This may be more work than you're looking to do but I think you could end up with a reasonable quality result

If not this, then I'd probably recommend splitting the PDFs into images, downsizing them, and repackaging. Hopefully you could end up with much smaller PDFs. You might consider grayscale, the first sample I loaded (which, yeah, is huge) doesnt benefit much from the color. A grayscale image is 1/3 the size of the same image in color, without any resizing. If you also made it 1/2 the width and height, then it would be around 1/12 the size of the original. That's a very large savings.
posted by RustyBrooks at 6:50 PM on July 21, 2014

As the PDL source server seems to be down at the moment, I could't see the original image scans. The PDF pages in your document are pretty heavy on the compression artefacts, and they have the source library's watermark embedded in the images. Maybe you'd want to avoid those.

Using JPEG-2000 compression and img2pdf, I got your 18MB Geschäftsordnung down to 3MB with only minor additional compression noise.
posted by scruss at 5:56 AM on November 29, 2014

« Older Math? College?   |   What was long-term recovery like for superficial... Newer »
This thread is closed to new comments.