Freely Accessible Etymology Database? Or tools to help create one?
October 30, 2014 2:26 PM   Subscribe

I have an idea for a project that would require the ability to search a dictionary of words and find the year of it's known introduction (as close as possible). I am aware of etymology-online (love that site), but since, as far as I'm aware, it's just a site, and the compilers don't have a publicly accessible database, I was wondering if anybody knows of any site that actually WOULD have a freely available database (either query via an API through the web, or downloadable to self-host)?

If there isn't any, does anybody have an idea who I might be able to contact? Would it be prudent to access the etymology online folks?

I feel like scraping their pages for the data would be reckless and kind of jerky in terms of bandwidth, but then again - it's mostly text and these days bandwidth is fairly cheap... But I suppose it'd be possible to do that as a last resort and script something to pull the data into a database? Any ideas on what would be good tools to do such a thing?
posted by symbioid to Technology (6 answers total) 3 users marked this as a favorite
 
Ask Before.

And i recommend python for scrapping. Maybe you can ask on opendata on reddit and stack exchange too :)


regards and good luck.
posted by bussiere at 2:39 PM on October 30, 2014 [1 favorite]


This is probably what you want.

(previously)
posted by aubilenon at 3:03 PM on October 30, 2014


Oh actually that doesn't have years. so it might NOT be what you want. But it's still awesome.
posted by aubilenon at 3:10 PM on October 30, 2014 [1 favorite]


The Wordnik API (documentation/test interface) has etymologies, though they don't (as far as I can tell) generally have an easily extractable year of introduction. For relatively recent words, you could use Google Ngram data to find the first year a given word appeared in published books.
posted by aparrish at 3:26 PM on October 30, 2014 [1 favorite]


> For relatively recent words, you could use Google Ngram data to find the first year a given word appeared in published books.

Bad idea; Google's metadata is notoriously unreliable.
posted by languagehat at 5:38 PM on October 30, 2014


Response by poster: I think my best bet is to ask the etymonline guy if he has data he's willing to share or I could pay for, or if not, if he minds if I scrape his page. Looks like most etymology stuff doesn't really have dates. And it doesn't have to be super accurate, just close enough. I wonder how he ended up getting dates, perhaps he used Google's ngram stuff and it's just as unreliable?
posted by symbioid at 7:32 PM on October 30, 2014


« Older The science of happy long term relationships   |   Catfilter: No, kitty, the couch is not a litterbox... Newer »
This thread is closed to new comments.