script dictionary
January 14, 2009 5:40 PM Subscribe
I need a script that will extract from a dictionary all words that contain a certain set of letters. e.g: "a", "b", e" should return "abe", "babe", "a" etc. I know this is trivial in perl: however I want the biggest dictionary I can get my hands on, not just the default on linux. So I guess my question has two parts: can you please point me to this script and also point me to the biggest free dictionary in one or all of these scripts: roman, cyrillic, greek. This possibly has been implemented as a website. Which website?
This is just a simple regular expression. If your dictionary has a word on each line,
^[abe]+$ is the regex you want. You can use grep, instead of perl.
posted by aubilenon at 5:53 PM on January 14, 2009
^[abe]+$ is the regex you want. You can use grep, instead of perl.
grep -E "^[abe]+$" dictionary-file-1 dictionary-file-2 dictionary-file-3 ...
posted by aubilenon at 5:53 PM on January 14, 2009
Response by poster: to clarify: the how? question was for lucidium's answer: how do i query onelook to get what i want?
posted by pita at 5:55 PM on January 14, 2009
posted by pita at 5:55 PM on January 14, 2009
I'm not sure how many words counts as "huge" but this includes ~168K words.
posted by If only I had a penguin... at 6:23 PM on January 14, 2009
posted by If only I had a penguin... at 6:23 PM on January 14, 2009
by the way, on Ubuntu 8.04, at least the following word lists are available for installation:
posted by jepler at 6:41 PM on January 14, 2009
wamerican - American English dictionary words for /usr/share/dict wbrazilian - Brazilian Portuguese wordlist wbritish - British English dictionary words for /usr/share/dict wbulgarian - The Bulgarian dictionary words for /usr/share/dict wcatalan - Catalan dictionary words for /usr/share/dict wfrench - French dictionary words for /usr/share/dict wirish - Irish (Gaeilge) dictionary words for /usr/share/dict witalian - The Italian dictionary words for /usr/share/dict/ wmanx - Manx Gaelic dictionary words for /usr/share/dict wogerman - The old German dictionary for /usr/share/dict wpolish - Polish dictionary words for /usr/share/dict wportuguese - European Portuguese wordlist wspanish - The Spanish dictionary words for /usr/share/dict wukrainian - Ukrainian dictionary words for /usr/share/dict wamerican-huge - American English dictionary words for /usr/share/dict wamerican-large - American English dictionary words for /usr/share/dict wamerican-small - American English dictionary words for /usr/share/dict wbritish-huge - British English dictionary words for /usr/share/dict wbritish-large - British English dictionary words for /usr/share/dict wbritish-small - British English dictionary words for /usr/share/dict wcanadian - Canadian English dictionary words for /usr/share/dict wcanadian-huge - Canadian English dictionary words for /usr/share/dict wcanadian-large - Canadian English dictionary words for /usr/share/dict wcanadian-small - Canadian English dictionary words for /usr/share/dict wfinnish - A small Finnish dictionary for /usr/share/dict wgaelic - A Scots Gaelic word list
posted by jepler at 6:41 PM on January 14, 2009
Ah sorry, it looks like you can't search for any order of the letters, only "*a*b*" in order. I thought I remembered there being more powerful wildcards.
"*" can count as zero characters too, though. So I suppose you could search for "*a*b*" and "*b*a*" to get the full list, if you've only got a few small sets you want to search for.
posted by lucidium at 6:52 PM on January 14, 2009
"*" can count as zero characters too, though. So I suppose you could search for "*a*b*" and "*b*a*" to get the full list, if you've only got a few small sets you want to search for.
posted by lucidium at 6:52 PM on January 14, 2009
Just to check, when given the string "abe" you want to find all words in the dictionary which contain "a" or "b" or "e"?
I just did a quick test on a dictionary of 32,000 English words and found that 20,000 of them contain the letter "e". Are you sure this is going to be useful?
posted by AmbroseChapel at 7:14 PM on January 14, 2009
I just did a quick test on a dictionary of 32,000 English words and found that 20,000 of them contain the letter "e". Are you sure this is going to be useful?
posted by AmbroseChapel at 7:14 PM on January 14, 2009
Just to check, when given the string "abe" you want to find all words in the dictionary which contain "a" or "b" or "e"?
I believe he wants all words containing "a", "b", or "e", but no other letters. This would limit your selection somewhat.
posted by Johnny Assay at 7:17 PM on January 14, 2009
I believe he wants all words containing "a", "b", or "e", but no other letters. This would limit your selection somewhat.
posted by Johnny Assay at 7:17 PM on January 14, 2009
no other letters? i can do that in my head, but something tells me that it's more than this.
posted by rhizome at 8:04 PM on January 14, 2009
posted by rhizome at 8:04 PM on January 14, 2009
Response by poster: Johnny has it right. To clarify: I gave "a", "b" and "e" as an example. In theory I would like to search through for words for *any* number of individual letters.
For example: give me words that contain: a c e f i l m n r s u v w z *but no other letters*
So: 'facile' 'film' 'suave' 'melee' are correct answers, but 'flippant' is not, since 'flippant' contains p and t - not in my list.
Please do note that this again is an example: this time I listed 14 letters - but I want a general purpose script that I can tweak- for 14 letters-or 20-or 8.
posted by pita at 8:25 PM on January 14, 2009
For example: give me words that contain: a c e f i l m n r s u v w z *but no other letters*
So: 'facile' 'film' 'suave' 'melee' are correct answers, but 'flippant' is not, since 'flippant' contains p and t - not in my list.
Please do note that this again is an example: this time I listed 14 letters - but I want a general purpose script that I can tweak- for 14 letters-or 20-or 8.
posted by pita at 8:25 PM on January 14, 2009
python:
posted by signal at 8:33 PM on January 14, 2009
dictionary = file("dictionary.txt").readlines() valid_letters="abcd" for word in dictionary: flag=True for letter in word.strip(): if letter not in valid_letters: flag=False if flag: print word
posted by signal at 8:33 PM on January 14, 2009
^ this assumes "dictionary.txt" has one word per line.
posted by signal at 8:33 PM on January 14, 2009
posted by signal at 8:33 PM on January 14, 2009
Using grep makes much more sense than that Python script. You don't need a script. It's a single command line.
posted by grouse at 8:52 PM on January 14, 2009
posted by grouse at 8:52 PM on January 14, 2009
So he would grep for "^a|c|e|f|i|l|m|n|r|s|u|v|w|z$"?
posted by AmbroseChapel at 9:04 PM on January 14, 2009
posted by AmbroseChapel at 9:04 PM on January 14, 2009
No, hold on, that's wrong, "^[acefilmnrsuvwz]+$" is what you want, at least as a perl regular expression.
posted by AmbroseChapel at 9:05 PM on January 14, 2009 [1 favorite]
posted by AmbroseChapel at 9:05 PM on January 14, 2009 [1 favorite]
Here's a perl script:
which you would call like
posted by AmbroseChapel at 9:22 PM on January 14, 2009
open( D, '<>while (<D>) {
print if /^[$ARGV[0]]+$/;
}>
which you would call like
wordfinder.pl acefilmnrsuvwz
posted by AmbroseChapel at 9:22 PM on January 14, 2009
Damn I hate posting code here!
posted by AmbroseChapel at 9:23 PM on January 14, 2009
open( D, '<', 'path/to/dictionary' ) or die "can't open dictionary";
while (<D>) {
print if /^[$ARGV[0]]+$/;
}
posted by AmbroseChapel at 9:23 PM on January 14, 2009
On a Unix system:
… etc.
The -v flag to egrep inverts a match, showing only lines that do not match the given pattern. The pattern [^abe] matches any line that contains any letter other than “a,” “b,” or “e.” So the command above produces all lines in /usr/share/dict/words that only contain those letters.
posted by ijoshua at 7:26 AM on February 12, 2009
% egrep -v "[^abe]" /usr/share/dict/words
a
aa
aba
abb
ae
b
ba
baa
baba
babe
bae
be
bee
e
ea
ebb
… etc.
The -v flag to egrep inverts a match, showing only lines that do not match the given pattern. The pattern [^abe] matches any line that contains any letter other than “a,” “b,” or “e.” So the command above produces all lines in /usr/share/dict/words that only contain those letters.
posted by ijoshua at 7:26 AM on February 12, 2009
I should add that my command above does a case-sensitive match. If you want case-insensitivity, make that flag -iv
posted by ijoshua at 7:43 AM on February 12, 2009
posted by ijoshua at 7:43 AM on February 12, 2009
That seems like an odd, backwards way to do it. It also takes almost twice as long as the more straightforward way:
posted by grouse at 7:58 AM on February 12, 2009
$ time egrep "^[abe]+$" /usr/share/dict/words > /dev/null
real 0m0.031s
user 0m0.029s
sys 0m0.002s
$ time egrep -v "[^abe]" /usr/share/dict/words > /dev/null
real 0m0.056s
user 0m0.055s
sys 0m0.002s
posted by grouse at 7:58 AM on February 12, 2009
You are correct, grouse. Thanks for pointing that out; I didn’t think of using the pattern to match the entire line.
posted by ijoshua at 11:04 AM on February 12, 2009
posted by ijoshua at 11:04 AM on February 12, 2009
This thread is closed to new comments.
posted by lucidium at 5:47 PM on January 14, 2009