Search, cross-link references in PDF articles?
June 20, 2010 8:44 PM Subscribe
Perhaps this is just a fantasy, but is there any application or online tool that could search through the references of an article I have saved as a PDF, in order to check whether I have those cited articles in my larger PDF library? It would be perfect if it would highlight, link, or somehow display cross-referenced relationships between all my articles. I am already familiar with many referencing/PDF organization software such as Papers (Mekentosj), Sente, and Devonthink. I suppose what I have in mind is a similar tool, but with the additional power of something like ISI Indexes. I have a fairly large library (about 300 references that I'm actively using, and more than 2000 total) and I'm just trying to get some "big-picture" grasp of how all these sources relate to one another.
I've wondered this myself, and it seems that there is not. The main problem is getting details of which papers cite which others. There are proprietary databases (eg. web of science) and some free, domain-specific databases (eg. citeseer), and some general databases (Google Scholar, but it is very incomplete), but there's isn't a complete, freely accessible database of citations.
So, the program would have to extract citation data from the PDFs themselves. This is hard, because of the wide variety of different citation formats, incomplete citations (and for obscure journas that aren't indexed anywhere, disambiguating two incomplete citations may be impossible), multiple abbreviations for the same journal, differences in how names are spelt, citation errors (I've read appears in which the authors mis-cite their own previous work!), and abbreviations like op. cit. and ibid. For a more detailed discussion of the difficulties, search google or citeseer for "citation extraction".
But if all the citations include DOIs, and you have tagged all your PDFs with their DOIs, the problem becomes much, much easier, and could be done fairly easily with a perl script. Unique numerical identifiers are the future!
posted by James Scott-Brown at 2:16 AM on June 21, 2010
So, the program would have to extract citation data from the PDFs themselves. This is hard, because of the wide variety of different citation formats, incomplete citations (and for obscure journas that aren't indexed anywhere, disambiguating two incomplete citations may be impossible), multiple abbreviations for the same journal, differences in how names are spelt, citation errors (I've read appears in which the authors mis-cite their own previous work!), and abbreviations like op. cit. and ibid. For a more detailed discussion of the difficulties, search google or citeseer for "citation extraction".
But if all the citations include DOIs, and you have tagged all your PDFs with their DOIs, the problem becomes much, much easier, and could be done fairly easily with a perl script. Unique numerical identifiers are the future!
posted by James Scott-Brown at 2:16 AM on June 21, 2010
The biggest problem is access to the Web Of Science/other Thomson Reuters database APIs. This used to be pretty much impossible last time I looked, but this seems to suggest that things are changing slightly:
http://bibwild.wordpress.com/2009/04/13/cited-by-from-isi-and-scopus-in-the-link-resolver/
posted by cromagnon at 4:46 AM on June 21, 2010
http://bibwild.wordpress.com/2009/04/13/cited-by-from-isi-and-scopus-in-the-link-resolver/
posted by cromagnon at 4:46 AM on June 21, 2010
Just to continue the conversation, Mendeley DOES pull citations from within the PDFs themselves.
posted by stratastar at 4:00 PM on July 9, 2010
posted by stratastar at 4:00 PM on July 9, 2010
stratastar, as far as I can tell, Medeley just extracts the bibliographic metadata for a PDF you import (as do other programs, like Papers); I don't think it extracts bibliographic metadata for papers mentioned/cited in a PDF.
I think samac wants a program to do the latter, which is much harder.
posted by James Scott-Brown at 8:08 AM on July 21, 2010
I think samac wants a program to do the latter, which is much harder.
posted by James Scott-Brown at 8:08 AM on July 21, 2010
This thread is closed to new comments.
posted by stratastar at 11:45 PM on June 20, 2010