script dictionary
January 14, 2009 5:40 PM   Subscribe

I need a script that will extract from a dictionary all words that contain a certain set of letters. e.g: "a", "b", e" should return "abe", "babe", "a" etc. I know this is trivial in perl: however I want the biggest dictionary I can get my hands on, not just the default on linux. So I guess my question has two parts: can you please point me to this script and also point me to the biggest free dictionary in one or all of these scripts: roman, cyrillic, greek. This possibly has been implemented as a website. Which website?
posted by pita to Health & Fitness (24 answers total) 2 users marked this as a favorite
 
I think you can do this with OneLook, which uses a pretty large range of dictionaries.
posted by lucidium at 5:47 PM on January 14, 2009


This is just a simple regular expression. If your dictionary has a word on each line,
^[abe]+$ is the regex you want. You can use grep, instead of perl.

grep -E "^[abe]+$" dictionary-file-1 dictionary-file-2 dictionary-file-3 ...
posted by aubilenon at 5:53 PM on January 14, 2009


ok: how?
posted by pita at 5:53 PM on January 14, 2009


to clarify: the how? question was for lucidium's answer: how do i query onelook to get what i want?
posted by pita at 5:55 PM on January 14, 2009


Previously.
posted by Johnny Assay at 6:17 PM on January 14, 2009


I'm not sure how many words counts as "huge" but this includes ~168K words.
posted by If only I had a penguin... at 6:23 PM on January 14, 2009


by the way, on Ubuntu 8.04, at least the following word lists are available for installation:
wamerican - American English dictionary words for /usr/share/dict
wbrazilian - Brazilian Portuguese wordlist
wbritish - British English dictionary words for /usr/share/dict
wbulgarian - The Bulgarian dictionary words for /usr/share/dict
wcatalan - Catalan dictionary words for /usr/share/dict
wfrench - French dictionary words for /usr/share/dict
wirish - Irish (Gaeilge) dictionary words for /usr/share/dict
witalian - The Italian dictionary words for /usr/share/dict/
wmanx - Manx Gaelic dictionary words for /usr/share/dict
wogerman - The old German dictionary for /usr/share/dict
wpolish - Polish dictionary words for /usr/share/dict
wportuguese - European Portuguese wordlist
wspanish - The Spanish dictionary words for /usr/share/dict
wukrainian - Ukrainian dictionary words for /usr/share/dict
wamerican-huge - American English dictionary words for /usr/share/dict
wamerican-large - American English dictionary words for /usr/share/dict
wamerican-small - American English dictionary words for /usr/share/dict
wbritish-huge - British English dictionary words for /usr/share/dict
wbritish-large - British English dictionary words for /usr/share/dict
wbritish-small - British English dictionary words for /usr/share/dict
wcanadian - Canadian English dictionary words for /usr/share/dict
wcanadian-huge - Canadian English dictionary words for /usr/share/dict
wcanadian-large - Canadian English dictionary words for /usr/share/dict
wcanadian-small - Canadian English dictionary words for /usr/share/dict
wfinnish - A small Finnish dictionary for /usr/share/dict
wgaelic - A Scots Gaelic word list

posted by jepler at 6:41 PM on January 14, 2009


Ah sorry, it looks like you can't search for any order of the letters, only "*a*b*" in order. I thought I remembered there being more powerful wildcards.

"*" can count as zero characters too, though. So I suppose you could search for "*a*b*" and "*b*a*" to get the full list, if you've only got a few small sets you want to search for.
posted by lucidium at 6:52 PM on January 14, 2009


Just to check, when given the string "abe" you want to find all words in the dictionary which contain "a" or "b" or "e"?

I just did a quick test on a dictionary of 32,000 English words and found that 20,000 of them contain the letter "e". Are you sure this is going to be useful?
posted by AmbroseChapel at 7:14 PM on January 14, 2009


Just to check, when given the string "abe" you want to find all words in the dictionary which contain "a" or "b" or "e"?

I believe he wants all words containing "a", "b", or "e", but no other letters. This would limit your selection somewhat.
posted by Johnny Assay at 7:17 PM on January 14, 2009


no other letters? i can do that in my head, but something tells me that it's more than this.
posted by rhizome at 8:04 PM on January 14, 2009


Johnny has it right. To clarify: I gave "a", "b" and "e" as an example. In theory I would like to search through for words for *any* number of individual letters.

For example: give me words that contain: a c e f i l m n r s u v w z *but no other letters*

So: 'facile' 'film' 'suave' 'melee' are correct answers, but 'flippant' is not, since 'flippant' contains p and t - not in my list.

Please do note that this again is an example: this time I listed 14 letters - but I want a general purpose script that I can tweak- for 14 letters-or 20-or 8.
posted by pita at 8:25 PM on January 14, 2009


python:
dictionary = file("dictionary.txt").readlines()
valid_letters="abcd"

for word in dictionary:
    flag=True
    for letter in word.strip():
        if letter not in valid_letters:
            flag=False
    if flag:
        print word

posted by signal at 8:33 PM on January 14, 2009


^ this assumes "dictionary.txt" has one word per line.
posted by signal at 8:33 PM on January 14, 2009


Using grep makes much more sense than that Python script. You don't need a script. It's a single command line.
posted by grouse at 8:52 PM on January 14, 2009


So he would grep for "^a|c|e|f|i|l|m|n|r|s|u|v|w|z$"?
posted by AmbroseChapel at 9:04 PM on January 14, 2009


No, hold on, that's wrong, "^[acefilmnrsuvwz]+$" is what you want, at least as a perl regular expression.
posted by AmbroseChapel at 9:05 PM on January 14, 2009 [1 favorite]


Here's a perl script:

open( D, '<>while (<D>) {
  print if /^[$ARGV[0]]+$/;
}


which you would call like

wordfinder.pl acefilmnrsuvwz
posted by AmbroseChapel at 9:22 PM on January 14, 2009


Damn I hate posting code here!

open( D, '<', 'path/to/dictionary' ) or die "can't open dictionary";
while (<D>) {
  print if /^[$ARGV[0]]+$/;
}

posted by AmbroseChapel at 9:23 PM on January 14, 2009


wineverygame.com
posted by dirtdirt at 11:17 PM on January 14, 2009 [1 favorite]


On a Unix system:

% egrep -v "[^abe]" /usr/share/dict/words
a
aa
aba
abb
ae
b
ba
baa
baba
babe
bae
be
bee
e
ea
ebb

… etc.

The -v flag to egrep inverts a match, showing only lines that do not match the given pattern. The pattern [^abe] matches any line that contains any letter other than “a,” “b,” or “e.” So the command above produces all lines in /usr/share/dict/words that only contain those letters.
posted by ijoshua at 7:26 AM on February 12, 2009


I should add that my command above does a case-sensitive match. If you want case-insensitivity, make that flag -iv
posted by ijoshua at 7:43 AM on February 12, 2009


That seems like an odd, backwards way to do it. It also takes almost twice as long as the more straightforward way:

$ time egrep "^[abe]+$" /usr/share/dict/words > /dev/null

real 0m0.031s
user 0m0.029s
sys 0m0.002s

$ time egrep -v "[^abe]" /usr/share/dict/words > /dev/null

real 0m0.056s
user 0m0.055s
sys 0m0.002s

posted by grouse at 7:58 AM on February 12, 2009


You are correct, grouse. Thanks for pointing that out; I didn’t think of using the pattern to match the entire line.
posted by ijoshua at 11:04 AM on February 12, 2009


« Older Buenos Aires? Yes? No?   |   Virtual Weight Loss Images Newer »
This thread is closed to new comments.