Wanted: American Heritage Dictionary headwords as spell check dictionary
December 5, 2019 9:32 AM Subscribe
Our house dictionary is the American Heritage Dictionary. Our main applications use Hunspell as their spell checker, whose built-in word list is not the AHD, and inevitably disagrees with it in some places. The natural solution would be to buy/rent/license the AHD word list, so that we can use it instead. How would I pursue this? I see nothing on the AHD/Houghton Mifflin Harcourt website. Or if it's impossible — maybe because of piracy concerns? — then how do other companies with a house dictionary handle this?
To get some objections and digressions out of the way:
To get some objections and digressions out of the way:
- I'm not mad that dictionaries disagree on spelling. I know it's normal. I'm not mad at Hunspell either. I just want to give it a different word list.
- I can use a word list in any format, from "plain text list" on up. I know, on a technical level, how to turn those into a Hunspell dictionary file. I just need to get my hands on the words.
- I know spell check can't replace human editors. I am a human editor. This would help my writers.
- I couldn't change our house dictionary if I wanted to. You can think it's a bad choice all you want, but that doesn't help me.
Response by poster: Well, to find out exactly how many differences there were I'd need the AHD's word list. :)
But my expectation is that it is not just a handful of words, but potentially hundreds or more — mostly words with variant spellings where different dictionaries have chosen different variants. The vast majority of these will be words that aren't in our in-house style guide, because we don't use them often enough to have made them worth making a note of.
posted by nebulawindphone at 11:18 AM on December 5, 2019
But my expectation is that it is not just a handful of words, but potentially hundreds or more — mostly words with variant spellings where different dictionaries have chosen different variants. The vast majority of these will be words that aren't in our in-house style guide, because we don't use them often enough to have made them worth making a note of.
posted by nebulawindphone at 11:18 AM on December 5, 2019
Could you programmatically run Hunspell over your existing corpus of edited articles, aggregate the words it flags as exceptions, then edit that list and use it as your base?
posted by books for weapons at 12:10 PM on December 5, 2019
posted by books for weapons at 12:10 PM on December 5, 2019
Response by poster: That would be a fun idea to toss around.
But I'm asking a real-world question: Can one license a dictionary in this way? And if not, what do actual companies who face this problem do, assuming they don't want to break the law?
If you know someone who has done the flag-and-aggregate-and-edit thing to meet the spell checking needs of a company with a few million words of documents, I'd love to talk to them.
posted by nebulawindphone at 12:48 PM on December 5, 2019
But I'm asking a real-world question: Can one license a dictionary in this way? And if not, what do actual companies who face this problem do, assuming they don't want to break the law?
If you know someone who has done the flag-and-aggregate-and-edit thing to meet the spell checking needs of a company with a few million words of documents, I'd love to talk to them.
posted by nebulawindphone at 12:48 PM on December 5, 2019
The copyright-eligibility of a bare list of headwords under US law is, at the very least, questionable under Feist v. Rural. That might be one reason why Houghton Mifflin is not in a rush to license it. (I'm assuming you're in the US since AFAIK the AHD doesn't have much uptake elsewhere.)
I can attest that it is possible to extract the list of ~114k headwords from the Stardict file of the AHD4 that jzb linked above, using this simple little Python 2.7 script, although you may need to change the mode in line 109 from "r" to "rb".
It might be prudent to run this idea past Legal before putting it into practice however.
posted by Not A Thing at 1:09 PM on December 5, 2019
I can attest that it is possible to extract the list of ~114k headwords from the Stardict file of the AHD4 that jzb linked above, using this simple little Python 2.7 script, although you may need to change the mode in line 109 from "r" to "rb".
It might be prudent to run this idea past Legal before putting it into practice however.
posted by Not A Thing at 1:09 PM on December 5, 2019
Best answer: When I was with Collins Dictionaries, we used to do this all the time. It was a very long time ago, so I can't help with prices. Here's their current language resources offerings.
The AHD4 electronic licensing page says the data comes as XML. This could (depending on the structure) be relatively easily parsed to create a headword list. Note that HMH don't provide just the bare word list - as Not A Thing noted, that would be hard to copyright in the US unless it had trap-words embedded in it. Collins successfully used a series of bizarre typos that had crept into one of their Spanish dictionaries that miraculously appeared in the framework of another publisher's Catalan dictionary. (In a rare piece of corporate good news, rather than a ruinous lawsuit it became a valuable partnership that produced better English/Catalan and English/Spanish dictionaries).
Having a giant wordlist for a spelling checker is not always the best, as they'll include rare-but-valid words (such as compotation and wether) that are misspellings of more common words. I don't know if you can feed in word frequencies along with headwords into hunspell, but that would catch the rare-but-valids.
After the L&H fiasco the dictionary data market went a bit quiet as people were wary of companies pursuing Underpants Gnomes-style "Buy all the word lists — ??? — Profit!” dotcom-boom shilling. Contact HMH through the page that jzb posted and ask them. They probably have consultants you can work with to get exactly what you want.
The "analysis of your corporate documents as a corpus" idea is a good one, but I don't know if it's available from consultants except to the very largest legal/government clients.
posted by scruss at 1:35 PM on December 5, 2019
The AHD4 electronic licensing page says the data comes as XML. This could (depending on the structure) be relatively easily parsed to create a headword list. Note that HMH don't provide just the bare word list - as Not A Thing noted, that would be hard to copyright in the US unless it had trap-words embedded in it. Collins successfully used a series of bizarre typos that had crept into one of their Spanish dictionaries that miraculously appeared in the framework of another publisher's Catalan dictionary. (In a rare piece of corporate good news, rather than a ruinous lawsuit it became a valuable partnership that produced better English/Catalan and English/Spanish dictionaries).
Having a giant wordlist for a spelling checker is not always the best, as they'll include rare-but-valid words (such as compotation and wether) that are misspellings of more common words. I don't know if you can feed in word frequencies along with headwords into hunspell, but that would catch the rare-but-valids.
After the L&H fiasco the dictionary data market went a bit quiet as people were wary of companies pursuing Underpants Gnomes-style "Buy all the word lists — ??? — Profit!” dotcom-boom shilling. Contact HMH through the page that jzb posted and ask them. They probably have consultants you can work with to get exactly what you want.
The "analysis of your corporate documents as a corpus" idea is a good one, but I don't know if it's available from consultants except to the very largest legal/government clients.
posted by scruss at 1:35 PM on December 5, 2019
Software company here: totally normal to license dictionary data, usually in XML, might be an annual charge.
posted by alasdair at 1:49 PM on December 5, 2019
posted by alasdair at 1:49 PM on December 5, 2019
This thread is closed to new comments.
You can find what purports to be the 4th edition of the AHD in dict format, if it's for in-house use only then that might suffice. Obviously if you're shipping software that uses this then I'd contact HM about licensing.
Silly question - how many words are we talking? Is it not possible to curate a list and just modify Hunspell's list in place? We don't have anything as fancy as a separate app with a spell checker (we use Google Docs for drafts) but we have an in-house style guide with info on any unique / specific spelling that's been curated by our "word nerds" for years.
posted by jzb at 10:57 AM on December 5, 2019 [2 favorites]