Is there software that can organize and search PDFs based on their subjects?
May 18, 2009 9:56 AM   Subscribe

How do you make journal article PDFs searchable by keywords, controlled vocabulary subject headings, as well as authors and titles?

I'm working for a team of researchers that would like to share their personal files of journal articles and reports with each other. They are primarily stored as pdfs, but there are also some Word documents. They use Windows machines and the core group shares a network drive. They would like a more organized system to allow more precise searching and useful sorting. I've looked at various bibliographic management software programs, but I'm hoping for something that will be able to grab metadata (like pre-defined subject headings) from a massive quantity of pdfs (and not imported citations) and not sure if they do that, or if the pdfs have that information embedded in them. I've also considered document management systems, but wonder if it might be overkill. We also have limited IT support. Although if I could find one that generates RSS feeds for the different researchers, based on the subjects of new articles added, that would be amazing, and also something they would very much like. Automatically generated hierarchical folders would be nice, unless I could convince them that search tools make that unnecessary. Will Owl, Alfresco, KnowledgeTree, OpenKM, or Sharepoint work? How difficult is it to implement these systems? Is it better to just stick to bibliographic software like Zotero, Aigaion, or Connotea? Which one of these would be ideal?

Posting anonymously since I'd rather be discreet for my employer's sake.
posted by anonymous to Computers & Internet (4 answers total) 2 users marked this as a favorite
Although this is probably overkill for you, my company sells a product called dotImage which includes tools that can extract and rewrite PDF document metadata. This makes is straightforward to implement a tool that scans for metadata for seeding a search engine as well as a tool to find all PDF documents that are missing metadata and set up a workflow to include it. I say overkill, because this is a tiny, tiny piece of our overall codebase to do one very specific task.

The code for adding in metadata looks something like this in C#:

PdfDocumentMetadata meta = new PdfDocumentMetadata();
meta.Title = "Ethel the Frog Goes Quantity Surveying";
meta.Author = "Eric Idle";
meta.Creator = "Plinth"; // etc.
meta.Append(outputStream, false);

It appears within the code that the licensing will work with dotImage Photo, which is our budget product. In addition, you can add PDF custom fields separate from the standard fields.

It is not an out-of-the-box solution - you would need to write code for this, but since you listed Sharepoint as a possible repository, it will integrate well with it since Sharepoint is .NET friendly.
posted by plinth at 10:44 AM on May 18, 2009

Something like Alfresco would certainly work. But no matter what you do, getting the specific information into a DMS or something similar will take a lot of work, even if the setup were easy.
With limited IT support I would have suggested something like a desktop search engine, but you certainly will run into problems with several people indexing a shared network drive ...

I didn't try it yet, but you might like OmiFind Yahoo edition. It seems to be something like an intranet document search engine. This version is free of charge and seems to be fairly easy to install. Check it out, maybe it works for your needs at least as a interim solution until you implement something more specific to your needs. (download)
posted by mmkhd at 1:22 PM on May 18, 2009

I need a thesaurus, I said "need" too often.
posted by mmkhd at 1:30 PM on May 18, 2009

I believe you can search multiple PDF files if you create an index. I don't know if this is only a full-text search or if it includes the metadata.
posted by momzilla at 12:44 PM on May 30, 2009

« Older Help my friend pretend he's The Stig   |   How do I figure out what is taking up all the... Newer »
This thread is closed to new comments.