Join 3,524 readers in helping fund MetaFilter (Hide)


1-s2.0-S0958166910000443-main dot PDF
September 11, 2012 6:14 AM   Subscribe

Does anyone know why academic PDF documents downloaded as full text seem to universally download as poorly documented/difficult to sort files?

Weird titles (like "fulltext", or "8" with no metadata [or metadata, but in the wrong places], no author info/name, etc.). It makes sorting large pools of papers much more difficult/time intensive, and seems to introduce "variability", which seems like a bad thing for getting disparate researchers who may cite someones's special research less likely. Anecdotally, this seems inversely related to the size of the publishing company sharing the papers (small outlets seem to have better documentation, while the largest have the worst).

Am I doing something wrong [I didn't even know to spread the rim of my ketchup baskets until this month]? Is this to prevent research by web-scraping? Is this concern for nothing? Can this [or I] be fixed? Do the large publishers not know any librarians or archivists? Is this a recognized issue?
posted by infinite intimation to Education (14 answers total) 4 users marked this as a favorite
 
Am I doing something wrong?

Can this [or I] be fixed?



I just rename mine to AuthorLastName/Names_TitleOfPaper.pdf.
posted by semaphore at 6:19 AM on September 11, 2012 [2 favorites]


Many people also use software like EndNote, which solves a lot of the issues that you are having.
posted by semaphore at 6:21 AM on September 11, 2012


I think the problem you're looking at is one of PDF organization, and having an "informative" filename only solves that problem in a very naive and ultimately not very useful way. Don't think of the title and author list as metadata of the PDF, but rather that the PDF itself is metadatum of a journal citation.

If you use something like endnote, you can attach the pdf you've downloaded to a citation entry, which will have author names, title, abstract, pub date, and keywords. That way, when you're in year 6 of your PhD or whatever, and you're looking for that paper you read during comps, you can do a thorough search in EndNote and then pull up the PDF file. Renaming something Names_TitleOfPaper.pdf is a good solution for very small collections of papers, but becomes useless later on (trust me, this is from experience).
posted by reformedjerk at 6:24 AM on September 11, 2012 [2 favorites]


At least some of the problem can be attributed to companies using the same PDFs for different purposes. Some low-level schmuck at a publishing company produces the PDF for distribution to the pressmen and then that same PDF is sent to the blind, to the author, to the internal archive, or to wherever, and they all have the same name, or they don't, and nobody seems much to care.

The large publisher I worked for may have had a librarian or archivist but they had no role in spreading any kind of company-wide naming convention for files. I have never worked at any company in any industry where there was a naming convention that was known and adopted by all the company's employees, or even a majority of them, even in data-centric industries. And just try putting one in place! Most people have never even considered such a thing as a file-naming policy. And consistency is hard to teach. "Why do I need to indicate the date in the file name? It's already in the directory listing. I don't undertand why it has to be author-date-keyword-version. Who cares about the order? And why can't I separate the words with spaces or underscores instead?"

All in all, I'd say file-naming shows a selfishness and like of foresight similar to car-driving. Incredibly smart people only consider where they're going and what they themselves need, to the sacrifice and detriment of everyone else and their needs.
posted by Mo Nickels at 6:30 AM on September 11, 2012


Mendeley is also a really useful solution to this problem, but again requires you to a bit of work on your end. Mostly, it's a matter of getting into the habit of using it for every PDF you look at, which takes a while. You can also use it to manage citations of books. The reasons to choose it over EndNote are primarily that it's free and has interesting social networking components that (as far as I know) EndNote does not.

(spread the rims on ketchup baskets? I am learning this right now.)
posted by dizziest at 6:42 AM on September 11, 2012 [1 favorite]


No, it's no you. This is how it is. Some publishers do have standards (e.g. the PDF title will be the DOI or some well-defined subsection thereof) but nobody's even tried to hammer out a standard. ScienceDirect, charmingly, used to name every download "science.pdf" -- I am downloading pure science!. I think they do something slightly saner now. And as you've noticed, in-PDF metadata is thin on the ground.

Anyway, as to actually dealing with this situation, here's what I do. When downloading a paper, I also download the citation (there's almost always an option to grab it in RIS, BibTeX, RefWorks, or whatever). Then, at once, I rename the paper according to my own naming scheme, and add the citation to my reference management software -- including a link to the filename, and a note on why I downloaded the paper and how I found it (very useful if I'm not reading it right away). All my papers live in the same folder, and the reference manager knows where to find them.
posted by pont at 6:49 AM on September 11, 2012 [1 favorite]


It may be that since there are other ways to acquire the appropriate metadata and manipulate citations and their associated full-text items, naming files is not worth it. At my library, we encourage people to use a citation manager.

Here's what I did with the file name that you mentioned. I googled it and brought up this page. I have the citation manager Zotero installed in Firefox (it's free); it was able to pull all the article's bibliographic information and -- had I access to the article -- I could have attached that PDF to the bib entry. Using Zotero I can add notes and tags to the entry, and create bibliographies using the stored metadata in most major styles. For example, this is the output for the aforementioned entry in APA 6th edition:
Scott, S. A., Davey, M. P., Dennis, J. S., Horst, I., Howe, C. J., Lea-Smith, D. J., & Smith, A. G. (2010). Biodiesel from algae: challenges and prospects. Current Opinion in Biotechnology, 21(3), 277–286. doi:10.1016/j.copbio.2010.03.005

If you are working on something requiring more than just a handful of sources, you should use a citation manager.
posted by cog_nate at 6:53 AM on September 11, 2012 [1 favorite]


I also use Zotero, and one handy feature it has is automatic file renaming (I think it goes Author Year Title.pdf.) This is massively useful for any time you have to email or send papers to other people - it's just friendly and eases communication, even if they will be downloading them into their own reference managers....
posted by heyforfour at 7:12 AM on September 11, 2012 [1 favorite]


Every single pdf I download is named `fulltext` something. I use Mendeley too and for the most part it will read, rename the file with metadata scraped from the document, and move it to a permanent location outside my download folder.
In case it is not able to parse the content, it will ask me to review the record and manually enter the metadata (doesn't happen all that often).

Mendeley is free - try it out.
posted by special-k at 7:25 AM on September 11, 2012


Am I doing something wrong?

No.

Is this to prevent research by web-scraping?

No.

Can this [or I] be fixed?

As others have mentioned, citation management software can auto-rename. I use Zotero.

Do the large publishers not know any librarians or archivists?

My money's on them just being technologically incompetent and/or not bothered enough to do anything about it.

Is this a recognized issue?

Oh hell yes.
posted by turkeyphant at 11:20 AM on September 11, 2012


As others have noted, this is a problem way bigger than filenames--good articles come with bookmarks to help the reader skip from the abstract to the methods discussion easily, and have proper hypertext and linking, etc.

I have a hunch that it is more common in e-versions of journals that are of print-origin than with digital native journals. Primarily, it speaks to the fact that the publishers do not value e-content highly. It is a recognized issue, at least among librarians. Some journals have done an excellent job of including metadata, but most are behind the times.
posted by epanalepsis at 12:44 PM on September 11, 2012


There is no *accepted* standard for journal article pdf metadata (although it's not exactly a hard problem, and there are lots of candidates). Thisarticle explains the inelegant way that JabRef (a free citation manager) has had to resort to.

In fact, everything about citation management is dysfunctional.
posted by cromagnon at 4:29 PM on September 11, 2012


I work in administration for a large university system, which informs my perspective: most people who generate PDFs (any kind of electronic document, really) are borderline incompetent.

At times I threaten to confiscate everybody's computers and issue them crayons and kindergarten scissors instead. "You cannot be trusted with formatting!" *death glare* "You are not allowed to use Styles! And you!" *fingers letter-opener threateningly* "If you so much as think about clicking 'View Slide Master' I will cut you."

Ahem.
posted by Lexica at 6:45 PM on September 11, 2012


Oh, and I still haven't been able to figure out why the PDFs we get from our General Counsel's office all have random letters in the filenames replaced with underscores. Bizarre.
posted by Lexica at 6:46 PM on September 11, 2012


« Older A problem that isn't one but s...   |  Looking for new ways to learn ... Newer »
This thread is closed to new comments.