Freely Accessible Etymology Database? Or tools to help create one?
October 30, 2014 2:26 PM Subscribe
I have an idea for a project that would require the ability to search a dictionary of words and find the year of it's known introduction (as close as possible).
I am aware of etymology-online (love that site), but since, as far as I'm aware, it's just a site, and the compilers don't have a publicly accessible database, I was wondering if anybody knows of any site that actually WOULD have a freely available database (either query via an API through the web, or downloadable to self-host)?
If there isn't any, does anybody have an idea who I might be able to contact? Would it be prudent to access the etymology online folks?
I feel like scraping their pages for the data would be reckless and kind of jerky in terms of bandwidth, but then again - it's mostly text and these days bandwidth is fairly cheap... But I suppose it'd be possible to do that as a last resort and script something to pull the data into a database? Any ideas on what would be good tools to do such a thing?
If there isn't any, does anybody have an idea who I might be able to contact? Would it be prudent to access the etymology online folks?
I feel like scraping their pages for the data would be reckless and kind of jerky in terms of bandwidth, but then again - it's mostly text and these days bandwidth is fairly cheap... But I suppose it'd be possible to do that as a last resort and script something to pull the data into a database? Any ideas on what would be good tools to do such a thing?
Oh actually that doesn't have years. so it might NOT be what you want. But it's still awesome.
posted by aubilenon at 3:10 PM on October 30, 2014 [1 favorite]
posted by aubilenon at 3:10 PM on October 30, 2014 [1 favorite]
The Wordnik API (documentation/test interface) has etymologies, though they don't (as far as I can tell) generally have an easily extractable year of introduction. For relatively recent words, you could use Google Ngram data to find the first year a given word appeared in published books.
posted by aparrish at 3:26 PM on October 30, 2014 [1 favorite]
posted by aparrish at 3:26 PM on October 30, 2014 [1 favorite]
> For relatively recent words, you could use Google Ngram data to find the first year a given word appeared in published books.
Bad idea; Google's metadata is notoriously unreliable.
posted by languagehat at 5:38 PM on October 30, 2014
Bad idea; Google's metadata is notoriously unreliable.
posted by languagehat at 5:38 PM on October 30, 2014
Response by poster: I think my best bet is to ask the etymonline guy if he has data he's willing to share or I could pay for, or if not, if he minds if I scrape his page. Looks like most etymology stuff doesn't really have dates. And it doesn't have to be super accurate, just close enough. I wonder how he ended up getting dates, perhaps he used Google's ngram stuff and it's just as unreliable?
posted by symbioid at 7:32 PM on October 30, 2014
posted by symbioid at 7:32 PM on October 30, 2014
« Older The science of happy long term relationships | Catfilter: No, kitty, the couch is not a litterbox... Newer »
This thread is closed to new comments.
And i recommend python for scrapping. Maybe you can ask on opendata on reddit and stack exchange too :)
regards and good luck.
posted by bussiere at 2:39 PM on October 30, 2014 [1 favorite]