Can the singularity help
November 10, 2011 2:00 PM

TediousTaskHelpFilter: For a project I'm working on, I manually went through 4000+ article abstracts from a literature database and classified articles as either target articles or irrelevant articles (with broad classification for why the article was irrelevant). My advisor has indicated that it's standard practice for me to do this same search on a database that will, overwhelmingly, give me many repeats from my last search. As this is nowhere near the only thing on my plate, I'd like to optimize this task. Is there any computer-based solution to eliminating the repeats in hits between these two databases?

For what it's worth, the two databases in question are PubMed/Medline (what I used originally) and PsycINFO. I believe I can manage to get text dumps of each pool of results. It'd be really sweet if I could pare the PsycINFO result list to not include everything on the PubMed list. I do know some computer science-y folks who might be able to help me out, but if the implementation is simple I could conceivably do it myself? Any ideas would be great. Thanks!
posted by Keter to Computers & Internet (5 answers total) 2 users marked this as a favorite
Someone else might be able to chip in with a method or whether this is possible at all but could you import everything into a referencing software liibrary like Endnote (which will offer benefits for your work) and then get it into a database from there?
posted by biffa at 2:32 PM on November 10, 2011


If you can get them into the same format, you could diff your list. For text, you can just use diff (presuming you have access to Unix in some flavor, e.g. a Mac).

For spreadsheets:
http://stackoverflow.com/questions/114698/how-do-i-diff-two-spreadsheets
posted by Nahum Tate at 2:37 PM on November 10, 2011


EndNote will definitely allow you to import articles from both databases and then remove duplicates. (You can also add your own field that could contain your target/irrelevant classification).

Your university library might have a site-license for EndNote or RefWorks (which I assume could do something similar)
posted by kbuxton at 2:50 PM on November 10, 2011


I'm not sure what tools to use, but if you downloaded the data from those databases with the records including the DOI for each article, that would provide you with a unique ID to match up and find/remove duplicates that's going to be better at doing it than just text matching (not sure how these reference databases will do the de-duping, probably similarly).
posted by marylynn at 3:14 PM on November 10, 2011


Careful with Endnote - if certain fields aren't perfect matches, it won't see the duplication. I think marylynn's right, though, and if you get the DOI, it'll override the other fields for duplication as the standard UID.

In a pinch, you can do the same thing by hand by saving your two searches as flatfiles (using whatever delimiter works for you and the reference database, which you would think would be the same for all PsycINFO instances but which sadly varies depending on what service manager your library subscribes to), importing them to your data management application of choice and merging them 1:1 using the DOI as the UID. Ideally the author surname fields would be identical across the two literature databases and you could use it as a checkpoint but I have not found this to be uniformly the case.
posted by gingerest at 5:39 PM on November 10, 2011


« Older I need more books to read!   |   Coffeeshop in Beirut Newer »
This thread is closed to new comments.