Organizing metadata of pdf files
November 3, 2010 1:57 PM   Subscribe

How hard is it to attach metadata to pdf files for references management software?

I have a few hundred pdf files, that are well organized within bibdesk (each pdf file has metadata and tags associated with it). I want to be able to attach the metadata to the pdf, so if the pdf is imported by another program (Medeley, papers etc) it will automatically have the associated data.

Is it possible to do this? What software or programming would I need to do?
posted by a womble is an active kind of sloth to Computers & Internet (10 answers total) 2 users marked this as a favorite
 
Perhaps you can export/print the metadata to a PDF document, and merge it with the original. You could sync two libraries, one for bibdesk, and the other for access by other reference managers.
posted by Blazecock Pileon at 2:07 PM on November 3, 2010


Are you a programmer? If so you will find it quite easy.

PDF allows incremental updates which means that you can literally append new content to a PDF file without the need to modify the original content.

All that is required is that your incremental update provides the byte offset to the trailer dictionary of the original, and also that your new content conforms to the PDF spec's requirements for an update, which mostly are to add a trailer dictionary and a new cross reference table is any objects are modified in the update. Most PDF consumers will ignore content which is not required to render a page. Therefore your arbitrary content can piggyback on an original file without much difficultly.

It would require a little bit of reading of the PDF reference, particularly the File Structure section (3.4) and its subsection on incremental updates (3.4.5), and the ability to compute the position of the trailer dictionary in the original copy of the PDF file. And that's pretty much it.
posted by galaksit at 2:33 PM on November 3, 2010 [1 favorite]


a new cross reference table is any objects are modified in the update

IF any objects are modified ...
posted by galaksit at 2:34 PM on November 3, 2010


I meant to add this link to a downloadable copy of the PDF reference.
posted by galaksit at 2:36 PM on November 3, 2010


Response by poster: Perhaps you can export/print the metadata to a PDF document, and merge it with the original. You could sync two libraries, one for bibdesk, and the other for access by other reference managers.

I can export the metadata to a pdf, but I'm not sure how to add it in from scratch. When you say 'merge', do you mean literally merging pdf files or more sophisticated merging of metadata?

I did find this program: http://www.pdflabs.com/docs/pdftk-man-page/
It appears to have some functionality for editing metadata, but I don't know how the files are structured. I have some programming experience, but didn't follow what the reference table is that galaksit mentioned. How exactly do you write to a pdf file?
posted by a womble is an active kind of sloth at 2:55 PM on November 3, 2010


A PDF file is just a flat file, like a text file. To write to it, you'd simply open the file for writing with an appropriate library API from your preferred programming language, seek to the end of the file, and then write bytes to it.

To understand the rest of my answer ("trailer dictionary", "cross reference table", etc.) you'll need to download the PDF Reference and read the sections I mentioned.

I did say something wrong originally. Your incremental update needs a new trailer dictionary, and it needs a /Prev entry that points not to the previous trailer but to the previous cross reference table.

Roughly speaking you're going to open the PDF file for reading at the end, extract last the cross reference table offset (following the startxref token), and then use that value when you write the /Prev entry in your incremental update's trailer dictionary.

Also, I realise I didn't say anything about extracting your existing metadata. I assumed you had it in a plain form accessible to you programmatically and that you could just insert it in any form you wanted as a new object in your incremental update. However, it sounds like you need to extract it from BibDesk first. Sorry, but I don't know anything about that.
posted by galaksit at 3:17 PM on November 3, 2010


I can export the metadata to a pdf, but I'm not sure how to add it in from scratch

The process I'm thinking of is two-fold:

You have, for example, {tags/metadata} + {journal article} in bibdesk.

You export the {tags/metadata} component to a scratch PDF file, then merge that with the {journal article}, to build a library of merged, tagged journal articles that you access from Papers, etc.

It would be tricky to get from the merged file back to how Papers (etc.) stores its metadata, but if you can live with the merged file by itself, you could keep using bibdesk as your primary reference manager, and use other managers on an as-needed basis.

With just about any scripting language and the pdfsam tool, it should be easy to keep a library of updated merged PDFs available to external-reference-managers.
posted by Blazecock Pileon at 4:11 PM on November 3, 2010


The standard for digital document metadata management is XMP, it's maintained by Adobe, and PDF files support it well. Assuming your applications can read and write XMP metadata then it's a no-brainer.

A 5-second glance at some google results suggests you can use bibdesk to embed metadata as XMP inside your PDFs. Take a look at the docs for your other packages and see if they too support it.
posted by russm at 3:58 AM on November 4, 2010


Response by poster: I've never heard of XMP data before - yes, I believe the other software reads this too. I can't find any sources stating that Bibdesk will attach this data to pdfs. Can you point me towards some?

This discussion makes me realize what a huge problem this is, nearly summed up in this paper: http://www.freelancepropaganda.com/archives/MP3vPDF.pdf which discusses the differences between MP3 metadata and PDF metadata.
posted by a womble is an active kind of sloth at 6:23 AM on November 4, 2010


I was basing my assumption on this fragment of text that showed up as the first hit on that google search - "As I said in my previous email, I found on hubmed's website some instructions to write XMP metadata on pdfs, specifically meant for bibdesk."... but the next sentence, "Nonetheless, this is far too complicated for the end-user." doesn't make it into google's extract...

if bibtex won't embed the data itself, you may be able to export the metadata in some plain-text format and then attach it to the files with Acrobat or one of the tools listed here (though those seem more focussed around digital imaging)...
posted by russm at 2:08 PM on November 4, 2010


« Older Help the Swede   |   Up to Big Sur... Newer »
This thread is closed to new comments.