Search, cross-link references in PDF articles?
June 20, 2010 8:44 PM   Subscribe

Perhaps this is just a fantasy, but is there any application or online tool that could search through the references of an article I have saved as a PDF, in order to check whether I have those cited articles in my larger PDF library? It would be perfect if it would highlight, link, or somehow display cross-referenced relationships between all my articles. I am already familiar with many referencing/PDF organization software such as Papers (Mekentosj), Sente, and Devonthink. I suppose what I have in mind is a similar tool, but with the additional power of something like ISI Indexes. I have a fairly large library (about 300 references that I'm actively using, and more than 2000 total) and I'm just trying to get some "big-picture" grasp of how all these sources relate to one another.
posted by samac to Computers & Internet (5 answers total) 8 users marked this as a favorite
You may just have to brute force it yourself. Maybe use a mind-map software.
posted by stratastar at 11:45 PM on June 20, 2010

I've wondered this myself, and it seems that there is not. The main problem is getting details of which papers cite which others. There are proprietary databases (eg. web of science) and some free, domain-specific databases (eg. citeseer), and some general databases (Google Scholar, but it is very incomplete), but there's isn't a complete, freely accessible database of citations.

So, the program would have to extract citation data from the PDFs themselves. This is hard, because of the wide variety of different citation formats, incomplete citations (and for obscure journas that aren't indexed anywhere, disambiguating two incomplete citations may be impossible), multiple abbreviations for the same journal, differences in how names are spelt, citation errors (I've read appears in which the authors mis-cite their own previous work!), and abbreviations like op. cit. and ibid. For a more detailed discussion of the difficulties, search google or citeseer for "citation extraction".

But if all the citations include DOIs, and you have tagged all your PDFs with their DOIs, the problem becomes much, much easier, and could be done fairly easily with a perl script. Unique numerical identifiers are the future!
posted by James Scott-Brown at 2:16 AM on June 21, 2010

The biggest problem is access to the Web Of Science/other Thomson Reuters database APIs. This used to be pretty much impossible last time I looked, but this seems to suggest that things are changing slightly:
posted by cromagnon at 4:46 AM on June 21, 2010

Just to continue the conversation, Mendeley DOES pull citations from within the PDFs themselves.
posted by stratastar at 4:00 PM on July 9, 2010

stratastar, as far as I can tell, Medeley just extracts the bibliographic metadata for a PDF you import (as do other programs, like Papers); I don't think it extracts bibliographic metadata for papers mentioned/cited in a PDF.

I think samac wants a program to do the latter, which is much harder.
posted by James Scott-Brown at 8:08 AM on July 21, 2010

« Older Help me get some art.   |   And now for some sound Newer »
This thread is closed to new comments.