How prevalent are PDFs in web publishing?
September 8, 2004 6:53 AM   Subscribe

How prevalent are PDFs in Web publishing? Why are they used in preference to other file formats? (mi)

Well, it’s a vaguely (but not entirely) work-related question that I’ve been gnawing on for a couple of days. What I really want to do is get a sense of how important the capability to search and interpret PDFs really is to an enterprise search application. I’ve found a couple of raw statistics – eg 13 million PDFs searched by Google when they launched file format search in 2001 – but I’d like to better understand how widely they’re used on the invisible Web too: particularly corporate intranets and the like.

So, in a nutshell:
* How prevalent are PDFs on the Web / intranets?
* Are there types of sites that use them more often than others eg government sites, particular types of businesses?
* Why do people publish documents on the Web / intranets in PDF rather than other file formats?

Anecdotal input from anyone working with content management very welcome!
posted by bifter to Computers & Internet (28 answers total)
* I'd say PDFs are, after HTML, the most common format used for text and graphics. Certainly more common than any other document format.

* Scientific and academic institutions use PDFs very widely as ewll.

* The biggest advantage of PDF is that you can be pretty sure that a document will appear just as you want it to, without it getting mangled by someone's weird MS Word setup, etc. Also they aren't as easily edited as word processor files.
posted by Space Coyote at 7:04 AM on September 8, 2004

A lot of corporations use Word files internally for the same purpose, because they can assure that all users have the software and they are editable.
posted by smackfu at 7:04 AM on September 8, 2004

"to an enterprise search application."

Aha! You'll find PDF is much more heavily used within the company than outside.

"Are there types of sites that use them more often than others eg government sites, particular types of businesses?"

Most non-technical departments in a large company are addicted to Word, and you'll find that there are gazillions of Word docs on the intranet. However, HR, marketing, reporting and PR departments tend to run a larger chunk their spew through Acrobat.

"Why do people publish documents on the Web / intranets in PDF rather than other file formats?"

Most business departments don't see the value of the web itself -- they only see a distribution medium for their forms, templates, and boilerplate to be printed and interoffice mailed. Word and PDF docs are published when the people doing the publishing: (a) don't get it, (b) want their output to look exactly the way it does on their screen, (c) don't care about the utility of the document to its consumers, (d) are bound to paper processes and using the intranet for a half-assed sort of automation; hey, they've removed the step where you have to call them up and ask for the form to be sent!
posted by majick at 7:12 AM on September 8, 2004

posted by grateful at 7:14 AM on September 8, 2004

Un-editability is key: I would never publish an editable version of a document. A paper, a form to fill out, etc. -- all of these are prime candidates for PDF. Especially when someone is going to download, keep, or print my document, it's important that it stays in its original format.

PDF ensures that the documents are accidentally edited/mangled by file conversion snafus, too. I would say that they're very common and that searching them efficiently (a la Apple's Preview app) is a must.

[On preview: certainly HR spew is lame, but PDF has a fair number of legitimate uses, most of which have to do with the integrity of the document as a whole -- ideally there would be HTML *and* PDF vesions of everything.]
posted by josh at 7:15 AM on September 8, 2004

One of the big reasons we use them is that HTML is a very unreliable medium for a lot of other alphabets - we have content in Urdu, Bengali, Gumurkhi and Gujarati (as well as a few Latin script based languages), and we've had a bloody awful time trying to get translations in forms that display reliably in HTML
posted by monkey closet at 7:15 AM on September 8, 2004

This doesn't really pertain to the question of searching, but PDF generation is fairly common in B2B web apps -- it allows the user to get a PDF generated on the fly with up to date info. Things like product info, catalogues, etc.
posted by o2b at 7:42 AM on September 8, 2004

The biggest reason that I see them used is when paper is involved (i.e. the content was originally created for distribution via brochure or a white paper). The creators are either so enamored with exacting layouts offered by print vs. the web or they don't want to spend the money to create two versions or they are ignorant on how awful pdf is to read on a computer screen. Here is example of a bus schedule from local regional transit authority:

Oh and Jakob Nielson says that PDFs are unfit for human consumption but if you are creating a searching/indexing tool I'd say you have to support it. One would think that the ability to search Word Docs and PDFs would be on almost everyone's corporate checklist.
posted by mmascolino at 7:43 AM on September 8, 2004

diez mil. dios mio.
posted by AwkwardPause at 7:47 AM on September 8, 2004

How do you ask a question to be the 10,000th...oh, nevermind.
posted by fletchmuy at 7:52 AM on September 8, 2004

most court documents available on the web are in .pdf (findlaw displays opinions in html versions) but the courts themselves are generally .pdf only. the library of congress legislative search service has both html and pdf available for bill text, pdf being courtesy of the GPO (so i would imagine it's a control issue with them).

when i worked in academic publishing, our contracts with authors did not allow online publication in any other format, entirely to keep the document integrity intact.
posted by crush-onastick at 7:56 AM on September 8, 2004

Too often they're used where they absolutely shouldn't be used. I'll get PDFs or Word documents from management that only contain 30 words of plain text and absolutely zero graphics or tricky formatting of any sort.

I'm a smart boy though, I learn quick. I delete PDFs or Word documents onsite if it comes from the usual abusers. I used to convert them and repost them to the rest of the recipients (who mostly felt the same way I do) till management got uppity.

I use PDF documents when I need something that will look polished and is multiple pages in length. If browsers were smart enough to hit the next page link when page down is hit I'd probably just publish HTML though.
posted by substrate at 7:56 AM on September 8, 2004

PDF sucks, but another reason it gets used is that it's easier to dump a large word processing document into PDF than to break it up into individual pages and format it in HTML. In a lot of situations, those large documents are exactly what you want to be searching.
posted by fuzz at 8:04 AM on September 8, 2004

We use PDFs extensively as downloads when the client wants something to resemble its online (and print/outdoor) brethren. And no, this is not ignorance of the web, it's simple branding (and a recognition that most of what we provide is meant to be printed).

Admittedly, most of our stuff would fall under the "fun" banner, and users would have little need to edit anything we provide, hence a PDF is the only reliable way to insure our fonts and imaging appear consistent.

Are they overused where a simple text document might work? Most certainly.
posted by jalexei at 8:06 AM on September 8, 2004

Ditto the idea that PDF is paper-related: and, of course, the reason your bus schedule is in PDF is because it's assumed that you're going to print it out and reconstitute it on the other end. The assumption is (I think, correctly) that I would like to have a compact, foldable, portable version of the document identical to the document as originally created. Thus forms, schedules, etc. are often in PDF. Wanton PDF abuse, as when people send out memos in PDF, is obviously not very useful.

To the question: many organizations use PDF to the degree that they are either bound to paper or bound to traditional publishing. Thus public services, academic institutions and so on use PDF very heavily.

Thus for some organization PDFs *should be* very rare, while for others they *should be* very prevalent. You can probably decide based on your clients or target audience how important PDF needs to be in your product. As to the desireability of PDF, I think everyone (producers, readers, etc.) would be better off if they take a more reader-centric (rather than web-centric, a la Jakob Nielsen) view of PDF. It has appropriate uses, etc. etc. In my experience, when content producers understand what PDF is for they can use it more appropriately.
posted by josh at 8:12 AM on September 8, 2004

Hmm... we had an accessibility audit and the guy maintains that PDFs are inherently inaccessible (I disagree, myself, tho the PDFs that our untrained HR/ Finance etc people make certainly are inaccessible).

Anyone comment on that, while we're here?
posted by Pericles at 8:20 AM on September 8, 2004

Pericles - I might respond that I'd rather talk a novice through downloading the ubiquitious Acrobat Reader than explaining the "open as: MS-DOS, MS WORD 5, 6, 7, RTF, TXT, Formatted blah blah blah" dialogue (and the resulting gibberish) they get with an errant Word file.
posted by jalexei at 8:36 AM on September 8, 2004

MIT's OpenCourseWare has made the decision to translate nearly all of its content into pdf, no matter the originating file format. Their reason for doing so is to create a lower barrier for use of OCW materials. By limiting the number of filetypes at OCW, they make it more likely a disadvantaged web user (oxymoron?) will have the tools necessary to access their material.
posted by Metametadata at 9:10 AM on September 8, 2004

We use PDFs for all of our client reports and internal documents because Word (and WordPerfect) suck terribly once you start embedding pictures/graphs/excel tables in them. Our standard 25-30 page report with a dozen figures might by 50 MB in Word, but only 2-3 MB as a PDF. Guess which version doesn't get bounced as too big by most mail systems.

We also like PDFs because it solved the whole client-getting-weird-pagination problems when using a different printer than the one in our offices. Even exchanging documents between users in our office used to be a nightmage for just various HP printers.

Regarding acessability, the US govt has very strict rules for all of their stuff and pretty much all of the big documents on US govt sites are PDFs. As part of the Canadian Federal govt, we also have strict rules, and PDFs are just fine for us too. What that says, I don't know.
posted by bonehead at 9:24 AM on September 8, 2004

I maintain several large online PDF archives. My clients are concerned about people plagerising their work product. Although the protection from editing and copying text are flawed, and bypassable by someone with a little tech savvy, it's enough to stop most users, and it's about the only way to protect your documents in this manner that's available.
posted by crunchland at 9:36 AM on September 8, 2004

In defense of PDF -- I used to be a hater as well, until I got OS X which has embedded PDF functionality in just about every application. It's easier to print to a PDF than it is to a printer, and there's a native application (Preview) that opens them, instead of Adobe's Reader.

Thanks to all of this, PDFs are quick to open and pretty easy to read (thanks in no small part to the lovely font smoothing).

It's my guess (and only a guess) that this is how Adobe would love PDFs to be treated across the board. Apple has obviously paid Adobe a licensing fee for all of this, and it makes the PDF experience quite seamless.
posted by o2b at 9:47 AM on September 8, 2004

wouldn't the accessibility issues be related to providing support for, say, people with sight problems? i don't think it's a technical accessibility thing, but rather a difficulty in presenting the information in alternative ways. i doubt word docs would be any better, but html does have markup to help support this kind of thing (eg the page could be "presented" via speec synthesis in a suitable browser).
posted by andrew cooke at 9:59 AM on September 8, 2004

Academic checking in. PDFs are certainly common around here, but PostScript files are just as (if not more) prevalent. The main reason for this is that (in physics, at least) almost everything is written these days in LaTeX, and the conversion chain for most LaTeX interpreters goes source -> DVI -> PostScript -> PDF. Everyone already has a PostScript reader on their machine, since they're already using Tex, so why not save yourself a step?

Of course, I'm in physics, and other natural sciences probably do things differently; my impression is that Word (shudder) is a lot more prevalent in chemistry and biology, though not in math.
posted by Johnny Assay at 10:01 AM on September 8, 2004

Adobe Reader 6 has built in voice dictation, and lets you save as text, if you need accessibility. It also lets you substitute document colours if that's your bag, too.

Apple has obviously paid Adobe a licensing fee for all of this, and it makes the PDF experience quite seamless.

Actaully PDF is a free and open standard, which is partly why Apple cohse it over postscript for Quartz. (as well as its having better colour management capabilities.)
posted by Space Coyote at 10:10 AM on September 8, 2004

Ah, thanks Space Coyote.
posted by o2b at 11:10 AM on September 8, 2004

johnny assay - in biology you don't need the sort of advanced equation editing (usually) that you do for physics or math. we use word because we generate documents, not alphabet soup that only makes sense to another physicist. (probably most of what we write only makes sense to another biologist, but hey. at least i don't use powerpoint to make presentation posters like some of my colleagues - illustrator or nothing for me there.)

personally i hate people who use pdf when they could have used html or text (as i hate people who use pictures of text rather than text, or java/flash for navigation menus... learn to code, or hire someone who can, for god's sake). i find that 80% of the time pdf is misused - order forms, tax forms, scholarly reprints, etc. should be in a pdf, for downloading or printing, but when i run across business that use pdfs for their online catalogs - or people who post multi-page plain text pdfs that are clearly just generated from a word document - i want to scream. beside the fact that acrobat 6 is slower than a dead marmot (if you don't disable 90% of the plugins) it's idiotic to not use the web the way it ought to be used. it would be like sending out a mail-order catalog to people, except instead of putting it on paper it's put on a disk. it doesn't make sense. when you do real-world stuff, use print. when you do web stuff, use the simplest electronic version available. don't mix them unless you have a good reason to.

my basic interpretation is that pdf is used so often because people are too lazy to generate and maintain more than one version of anything. however, most programs these days can save as html natively. most fancy html editors can auto-cleanup docs that were created by using the office "save as html" feature. it's damn ugly code but it's not much more of a hassle than saving as a pdf after making a word doc. i never put any pdf content on my site unless i have a plain text / html version there too, unless it's an order form, tax form, etc.
posted by caution live frogs at 11:27 AM on September 8, 2004

PDF is *the* Portable Document Format. It can represent anything that is printable, and PDF documents can be viewed or printed with identical results on any modern OS. If these features match the goals you set for your document, then PDF is a very good choice, and its advantages may outweigh the convenience of, say, HTML.
posted by Galvatron at 11:32 AM on September 8, 2004

I work in the technology end of an ad agency. We use PDF's very extensively, for two main purposes:

- Internal forms that need to actually be printed out (requiring a signature).

- Ads sent electronically to clients for review. A Quark document is HUGE, and generally the person who needs to approve it at the client side doesn't have Quark. PDF provides a "close enough to print" version, which, when combined with Acrobat's pretty-good commenting tools, lets you Get Stuff Done.

That said, we don't index the content of our PDF's, since it's generally light on "searchable" content. Rather, we index it by title, product, job, etc.
posted by mkultra at 2:02 PM on September 8, 2004

« Older What does the name Fifika mean>   |   I'm looking for a cheap pop3 mail for my... Newer »
This thread is closed to new comments.