Cross citation analysis - What cites what across 550 pdfs
October 31, 2019 10:16 PM   Subscribe

I'd like to find - which papers cite other papers, and get a list showing what cites what - are there any papers uncited by any other papers I'm doing this as I want to filter out the papers that have less value as I feel that about fifty are really really useful, others less so. Would prefer a desktop solution, and free.

All file names differ from the pdf internal filename e.g. abacusReading Epigenetic2014_Herman_etal_Evolution.pdf has an internal name of: How stable should epigenetic modifications be? insights from adaptive plasticity and bet hedging.

There are approx 550 papers, all text-based pdfs, most of them include figures and photos if that matters at all. Some papers are not findable on the web. If it matters topic is agricultural hydrology and solving nutrient pollution.

All papers are in one folder on my local system. System is win10 64bit.

Solutions I've found produce graphical outputs (and\or require a lot of data cleaning) and a list is all I really need; for something like this I feel graphics obscure the information. Things I've looked at:

RCitation [https://www.researchgate.net/publication/327790285_R_script_for_creating_a_cross-citation_network]which is an R script for creating a cross-citation network, but it requires converting all files to txt first, and huge amounts of prework and still sound like it will miss many citations.

Gephi looks like a lot of pre-work.

https://www.vosviewer.com - Seems oriented towards online sources only
posted by unearthed to Computers & Internet (3 answers total)
 
It would probably be easier to just get the identity of each paper (title, authors & journal, or DOI for those that have it), and then look those papers up in one or more of the existing online services that track citations. They have already done the tedious work of getting the citations out of the papers and cross-referencing them.

You will almost certainly have some work left over to do to fill in the gaps for papers not found online, old scanned papers that don't OCR well, etc. But that is going to be the case for any solution I'm afraid.

However, I would strongly caution against using citation counts as a measure of usefulness in the first place. There is lots of great work out there that has never been cited, and lots of crap work that is cited repeatedly. The processes driving this often have more to do with politics and power than scholarship.
posted by automatronic at 3:49 AM on November 1, 2019 [6 favorites]


Response by poster: Thanks, just to clarify I'm not using citation counts at all, I'm only interested in the References \ Literature \ Further Reading etc section of each pdf in my collection.

I have already isolated and removed "old scanned papers that don't OCR well, etc"
posted by unearthed at 2:32 PM on November 1, 2019


Response by poster: I will ask some further questions in a new post as I have thought more about this both from writing the questions and pondering automatronic's answer.
posted by unearthed at 12:36 PM on December 1, 2019


« Older No, I Insist!   |   Pump Up Chicago ‘19 Newer »
This thread is closed to new comments.