Word I find software that can get to the root of this problem?
May 13, 2010 1:06 PM   Subscribe

How do I obtain a list of all words associated with a list of word roots / stems?

(Please excuse my possibly incorrect linguistics/computer science terminology. )

Essentially, I want to automate the process of finding all existing words with the same root. If the root were "alarm", I would want the output to be alarm, alarms, alarming, alarmed, etc.

I have been given a large list of words in an Excel document. Each column represents a category of words and word roots. Entries with an asterisk at the end (e.g. alarm*) represent a root and need to be "reverse stemmed", for ignorance of a more accurate term. For each column, the output should then be added to an already existing text file of alphabetically sorted words belonging to that same category. One more catch, the existing text file has a column of the number "2" to the right of each word. I would like the newly inserted words to have the number "5" in that same column.

I'm hoping that someone knows of a way to get this done without resorting to writing my own program. I am not averse to the Linux command line, just very rusty. Even software that allows me to manually input a root and receive a list of possible words would save a substantial amount of time, however.
posted by hooves to Computers & Internet (6 answers total) 3 users marked this as a favorite
Interestingly enough, the business I work for does validation of user data online. One of the things we do is see if fraudsters try to use words as names.

A colleague showed me a word list with derived words a few months ago that he was using for this detection. When he gets back from lunch I'll see if I can get it from him.
posted by TimeTravelSpeed at 1:48 PM on May 13, 2010

I think omegawiki database might have this type of information. You'll have to write a script to try to get it out of the db though. The database structure is (was?) undocumented and tricky to figure out though.
posted by rainy at 1:57 PM on May 13, 2010

You can also try this. Not sure if these are the ones my colleague has, but might be in there somewhere.
posted by TimeTravelSpeed at 2:03 PM on May 13, 2010

The standard Linux cli way to do this:
grep alarm /usr/share/dict/words

change alarm to '^alarm' if you only want words starting with that stem (though for that particular example you get the same output either way).

change it to 'alarm$' to get only words ending in alarm (only the word "alarm" itself in this case).

on a debian based system (this includes ubuntu) for american spelling I recommending the package wamerican-huge - but you can use any package providing the 'wordlist' virtual package

sudo aptitude install wamerican-huge
posted by idiopath at 2:19 PM on May 13, 2010 [1 favorite]

There's a great web site I use all the time at wordwaldo.com that does just that. You put 'alarm' into either "starts with", "contains", or "ends with" and it spits out all matches.
posted by yehaskel at 10:19 PM on May 13, 2010

Thank you to everyone who offered advice. This will get me started.
posted by hooves at 1:34 PM on May 25, 2010

« Older Getting some on the side   |   So, what can you tell me about this Fulper piece? Newer »
This thread is closed to new comments.