New spelling, new root?
April 13, 2012 4:42 PM   Subscribe

I'd like to check if changed spellings are changed words.

For a commercial oproject I have in mind, I want to be able to compare to words and see if the spellings are versions of the same word or not. For example age can become ages and aged and it is the same root, but garbage is a new word. Conventional dictionaries show this by listing prefixes and suffixes for words, but online dictionaries I've found are all just exhaustive lists of spellings. Are there dictionaries that exist electronically structured as root plus legal suffixes and prefixes? As noted, for commercial use, but we may be able to pay licensing.
posted by meinvt to Writing & Language (6 answers total) 2 users marked this as a favorite
Yes - the word you're looking for is "lemma" - age/aged/ages all lemmatize to age. might be/have what you want.
posted by spaceman_spiff at 4:52 PM on April 13, 2012 [3 favorites]

Collins dictionaries (for whom I used to work more than a decade ago) used to sell lemma frameworks for several languages.
posted by scruss at 4:55 PM on April 13, 2012 [2 favorites]

The Natural Language Toolkit for Python uses Princeton's WordNet to accomplish this relatively easily.
posted by ob1quixote at 5:34 PM on April 13, 2012 [2 favorites]

A related notion is stemming, a process by which words are converted to their stems, and if the stems of two (literally) different words match, then they are likely different forms of the same word. The standard stemming algorithm is the Porter Stemming Algorithm (for English). Stemming algorithms are far from perfect (due to the irregularities of language) but they can get you pretty good results with very little effort.
posted by axiom at 6:31 PM on April 13, 2012 [1 favorite]

Great info so far. Looks like I'll need to research which of these are unburdened enough they can be used commercially, which work in a no connected environment and relative storage/processor footprint. Thanks for the advice so far, further suggestions welcome.
posted by meinvt at 8:16 PM on April 13, 2012

All of the approaches mentioned are quite fast.

If you've got especially tight constraints on storage, you should use the Porter stemmer (or, better yet, Porter2, which is a bit of an improvement). It doesn't rely on any sort of stored dictionary. The downside is that, as axiom says, it's not perfectly accurate.

Something like Wordnet will take up more space, but it will be more accurate. The downside is that it depends on a stored dictionary to match words to lemmas. Rare words that aren't in the dictionary will trip it up.

Maybe the best solution is to use Porter plus a customized list of exceptions — words that will come up reasonably often in your application for which Porter gives the wrong result. The standard Porter algorithm already includes a few common exceptions: it "knows," for instance, that herring isn't related to her, and early isn't related to ear. But you can generate your own list of exceptions to cover rarer cases. Take a list of words and lemmas — like for instance the Wordnet dictionary — and run them all through the Porter stemmer. Filter out the ones for which Porter makes the correct prediction, and keep the ones for which it makes the wrong prediction. This will give you a fairly short list of exceptions which you can add to the program.

That last approach will be the best of all worlds: broader coverage and much smaller storage footprint than a dictionary alone, better accuracy than Porter alone. Unless you have incredibly tight storage constraints, it's what I'd recommend.
posted by nebulawindphone at 10:19 PM on April 13, 2012 [3 favorites]

« Older How do vegetarians enjoy fresh tarragon?   |   How can I successfully land an internship as an... Newer »
This thread is closed to new comments.