Wanted: American Heritage Dictionary headwords as spell check dictionary
December 5, 2019 9:32 AM   Subscribe

Our house dictionary is the American Heritage Dictionary. Our main applications use Hunspell as their spell checker, whose built-in word list is not the AHD, and inevitably disagrees with it in some places. The natural solution would be to buy/rent/license the AHD word list, so that we can use it instead. How would I pursue this? I see nothing on the AHD/Houghton Mifflin Harcourt website. Or if it's impossible — maybe because of piracy concerns? — then how do other companies with a house dictionary handle this?

To get some objections and digressions out of the way:
  1. I'm not mad that dictionaries disagree on spelling. I know it's normal. I'm not mad at Hunspell either. I just want to give it a different word list.
  2. I can use a word list in any format, from "plain text list" on up. I know, on a technical level, how to turn those into a Hunspell dictionary file. I just need to get my hands on the words.
  3. I know spell check can't replace human editors. I am a human editor. This would help my writers.
  4. I couldn't change our house dictionary if I wanted to. You can think it's a bad choice all you want, but that doesn't help me.
posted by nebulawindphone to Writing & Language (7 answers total) 3 users marked this as a favorite
You might be able to reach out to someone here about licensing.

You can find what purports to be the 4th edition of the AHD in dict format, if it's for in-house use only then that might suffice. Obviously if you're shipping software that uses this then I'd contact HM about licensing.

Silly question - how many words are we talking? Is it not possible to curate a list and just modify Hunspell's list in place? We don't have anything as fancy as a separate app with a spell checker (we use Google Docs for drafts) but we have an in-house style guide with info on any unique / specific spelling that's been curated by our "word nerds" for years.
posted by jzb at 10:57 AM on December 5, 2019 [2 favorites]

Well, to find out exactly how many differences there were I'd need the AHD's word list. :)

But my expectation is that it is not just a handful of words, but potentially hundreds or more — mostly words with variant spellings where different dictionaries have chosen different variants. The vast majority of these will be words that aren't in our in-house style guide, because we don't use them often enough to have made them worth making a note of.
posted by nebulawindphone at 11:18 AM on December 5, 2019

Could you programmatically run Hunspell over your existing corpus of edited articles, aggregate the words it flags as exceptions, then edit that list and use it as your base?
posted by books for weapons at 12:10 PM on December 5, 2019

That would be a fun idea to toss around.

But I'm asking a real-world question: Can one license a dictionary in this way? And if not, what do actual companies who face this problem do, assuming they don't want to break the law?

If you know someone who has done the flag-and-aggregate-and-edit thing to meet the spell checking needs of a company with a few million words of documents, I'd love to talk to them.
posted by nebulawindphone at 12:48 PM on December 5, 2019

The copyright-eligibility of a bare list of headwords under US law is, at the very least, questionable under Feist v. Rural. That might be one reason why Houghton Mifflin is not in a rush to license it. (I'm assuming you're in the US since AFAIK the AHD doesn't have much uptake elsewhere.)

I can attest that it is possible to extract the list of ~114k headwords from the Stardict file of the AHD4 that jzb linked above, using this simple little Python 2.7 script, although you may need to change the mode in line 109 from "r" to "rb".

It might be prudent to run this idea past Legal before putting it into practice however.
posted by Not A Thing at 1:09 PM on December 5, 2019

When I was with Collins Dictionaries, we used to do this all the time. It was a very long time ago, so I can't help with prices. Here's their current language resources offerings.

The AHD4 electronic licensing page says the data comes as XML. This could (depending on the structure) be relatively easily parsed to create a headword list. Note that HMH don't provide just the bare word list - as Not A Thing noted, that would be hard to copyright in the US unless it had trap-words embedded in it. Collins successfully used a series of bizarre typos that had crept into one of their Spanish dictionaries that miraculously appeared in the framework of another publisher's Catalan dictionary. (In a rare piece of corporate good news, rather than a ruinous lawsuit it became a valuable partnership that produced better English/Catalan and English/Spanish dictionaries).

Having a giant wordlist for a spelling checker is not always the best, as they'll include rare-but-valid words (such as compotation and wether) that are misspellings of more common words. I don't know if you can feed in word frequencies along with headwords into hunspell, but that would catch the rare-but-valids.

After the L&H fiasco the dictionary data market went a bit quiet as people were wary of companies pursuing Underpants Gnomes-style "Buy all the word lists — ??? — Profit!” dotcom-boom shilling. Contact HMH through the page that jzb posted and ask them. They probably have consultants you can work with to get exactly what you want.

The "analysis of your corporate documents as a corpus" idea is a good one, but I don't know if it's available from consultants except to the very largest legal/government clients.
posted by scruss at 1:35 PM on December 5, 2019

Software company here: totally normal to license dictionary data, usually in XML, might be an annual charge.
posted by alasdair at 1:49 PM on December 5, 2019

« Older Why do mobile phone makers offer trade-in credits?   |   Light up my life Newer »

You are not logged in, either login or create an account to post comments