Comments on: Search, cross-link references in PDF articles?

Question: Search, cross-link references in PDF articles?

samac — Sun, 20 Jun 2010 20:44:01 -0800

Perhaps this is just a fantasy, but is there any application or online tool that could search through the references of an article I have saved as a PDF, in order to check whether I have those cited articles in my larger PDF library? It would be perfect if it would highlight, link, or somehow display cross-referenced relationships between all my articles. I am already familiar with many referencing/PDF organization software such as Papers (Mekentosj), Sente, and Devonthink. I suppose what I have in mind is a similar tool, but with the additional power of something like ISI Indexes. I have a fairly large library (about 300 references that I'm actively using, and more than 2000 total) and I'm just trying to get some "big-picture" grasp of how all these sources relate to one another.

By: stratastar

stratastar — Sun, 20 Jun 2010 23:45:39 -0800

You may just have to brute force it yourself. Maybe use a mind-map software.

By: James Scott-Brown

James Scott-Brown — Mon, 21 Jun 2010 02:16:59 -0800

I've wondered this myself, and it seems that there is not. The main problem is getting details of which papers cite which others. There are proprietary databases (eg. web of science) and some free, domain-specific databases (eg. citeseer), and some general databases (Google Scholar, but it is very incomplete), but there's isn't a complete, freely accessible database of citations.

So, the program would have to extract citation data from the PDFs themselves. This is hard, because of the wide variety of different citation formats, incomplete citations (and for obscure journas that aren't indexed anywhere, disambiguating two incomplete citations may be impossible), multiple abbreviations for the same journal, differences in how names are spelt, citation errors (I've read appears in which the authors mis-cite their own previous work!), and abbreviations like op. cit. and ibid. For a more detailed discussion of the difficulties, search google or citeseer for "citation extraction".

But if all the citations include DOIs, and you have tagged all your PDFs with their DOIs, the problem becomes much, much easier, and could be done fairly easily with a perl script. Unique numerical identifiers are the future!

By: cromagnon

cromagnon — Mon, 21 Jun 2010 04:46:36 -0800

The biggest problem is access to the Web Of Science/other Thomson Reuters database APIs. This used to be pretty much impossible last time I looked, but this seems to suggest that things are changing slightly:

http://bibwild.wordpress.com/2009/04/13/cited-by-from-isi-and-scopus-in-the-link-resolver/

By: stratastar

stratastar — Fri, 09 Jul 2010 16:00:31 -0800

Just to continue the conversation, Mendeley DOES pull citations from within the PDFs themselves.

By: James Scott-Brown

James Scott-Brown — Wed, 21 Jul 2010 08:08:02 -0800

stratastar, as far as I can tell, Medeley just extracts the bibliographic metadata for a PDF you import (as do other programs, like Papers); I don't think it extracts bibliographic metadata for papers mentioned/cited in a PDF.

I think samac wants a program to do the latter, which is much harder.