Word Frequency Counts
March 28, 2011 4:31 PM

Can I compute how frequently a word occurs in general English text? I have a list of about 2000 words, and I want to sort it with the most common words first.

I know the dumb way to do this: download some huge corpus and count count count count count. Two problems with that are: (1) most words on my list are pretty obscure so I expect to get a lot of zeros; (2) it'll be very slow. Counting Google hits is out because Google doesn't allow automated searches.

I'm hoping there's some kind of online database that I can query, like WordNet—except WordNet seems to have frequency counts only for extremely common words. Or if you have any other ideas, let me know. Thanks!

posted by Chicken Boolean to Computers & Internet (27 answers total) 7 users marked this as a favorite

I don't think there is such a thing as "general English text". If your question was for novels, newspaper articles, movie reviews, medical journal articles, etc., you could do it using a corpus based approach. Even Google doesn't count "general English text", it counts occurrences on webpages that Google indexes.
posted by demiurge at 4:35 PM on March 28, 2011

Would this help?
posted by Conrad Cornelius o'Donald o'Dell at 4:40 PM on March 28, 2011

Adam Kilgarriff's British National Corpus word frequency lists in various formats.
posted by qxntpqbbbqxl at 4:41 PM on March 28, 2011

I know there's really no such thing as "general English text." This doesn't need to be exact, so pretty much any kind of text would work, if there's enough of it. I think the BNC list is probably too small.

Google Ngrams looks great, but does it allow automated searching online?
posted by Chicken Boolean at 4:47 PM on March 28, 2011

This may help.
posted by dfriedman at 4:50 PM on March 28, 2011

The Google Ngram dataset is available for download. You probably only want the 1-grams.
posted by teraflop at 4:57 PM on March 28, 2011

I asked a similar question a while back. Maybe some of the resources there will be helpful to you.
posted by aparrish at 5:09 PM on March 28, 2011

http://www.americancorpus.org/

You can search it.
posted by zeek321 at 5:12 PM on March 28, 2011

The Google 1-grams are broken out year-by-year so you'd have to go through and get the sum and then sort those resulting numbers. For my own needs I've already done this, looking only at occurrences after 1950. If that suits your needs you can get it from me (36MB gzipped, 3,181,244 "words") at
http://dictionaut.com/google-1gram-frequencies-since-1950.txt.gz
The columns are all Google's except the last one which I calculated:
word | token_count | page_count | volume_count | freq_per_million
3 million lines should clue you in that there's a lot of noise and it's too big to open in Excel.
posted by xueexueg at 5:13 PM on March 28, 2011

I saw that Google Ngrams can be downloaded, but I'd prefer a data set that can be searched online, if one exists. Google's onlne viewer shows its results as an image, so I can't even try to search it with a script.

(I just downloaded Google's 1-grams; 2- or 3-grams would also be useful, but the 2-grams are about 25 GB and the 3-grams are impossibly huge.)
posted by Chicken Boolean at 5:15 PM on March 28, 2011

The Corpus of Contemporary American English (COCA) will do exactly what you want and more. If I wasn't under a massive time crunch to finish up an eerily similar task, I would join your list with one of their frequency lists (although I don't think they have a free one that is just texts) for you. If you can get a hold of a frequency list, a little SQL magic will extract the counts you need for your list.

I don't know how specific/accurate you want, but there are quite a few problems with the Google Ngram...mostly with bad OCR recognition. Also, do you want all varieties of English, or something more specific (American, British, etc.)?
posted by iamkimiam at 5:16 PM on March 28, 2011

Thanks xueexueg. And I'm checking out COCA right now. I want all varieties of English and I don't care much about accuracy. (So many answers so fast!)
posted by Chicken Boolean at 5:18 PM on March 28, 2011

Oh, and the companion site, Corpus of Historical American English (COHA)...it's texts!
posted by iamkimiam at 5:35 PM on March 28, 2011

If you're in some flavor of Unix you can query my list for your list fairly cheaply: if your wordlist has one word per line, you could do

cat wordlist | sed 's/^/\^/g;s/$/ /g' | grep -f - google-1gram-frequencies-since-1950.txt | sort -rnk5 > my-words-sorted-by-frequency.txt
posted by xueexueg at 6:28 PM on March 28, 2011

sed 's/^/\^/g;s/$/ /g' wordlist | grep -f - google-1gram-frequencies-since-1950.txt | sort -rnk5 > my-words-sorted-by-frequency.txt

is even cheaper :-)
posted by flabdablet at 10:15 PM on March 28, 2011

That is really beautiful xueexueg and flabdablet!

I'd like to take that code an apply it to other files (I'm also working on corpus list comparisons and many of my files are much too 'messy' and big for excel). Would one of you (or any other onlookers) mind giving a quick translation of what the elements in the string mean (for ex. I think '|' means 'pipe', which means 'feed this into that')? That would be super helpful, thanks!
posted by iamkimiam at 11:35 PM on March 28, 2011

sed is the stream editor, and it makes edits to each line of the input (in the case wordlist). s/FOO/BAR/g means search for an occurrence of the regular expression FOO and replace it with BAR. The /g on the end means do it for every occurrence of FOO in the line (although its use here is redundant and not needed.) Multiple s///g expressions can be given to sed, separated by semicolons, and they are executed in series. The regular expression ^ matches the beginning of a line, and the replace text \^ means a literal carat (in this case the \ is redundant and not necessary because the replace text is not a regular expression but in general it means quote the next character to mean a literal value instead of something that has meaning as a metacharacter.) The regular expression $ matches at the end of a line, and the replacement is a space -- actually this should be a TAB character (typable by Ctrl-V then TAB) but I don't think that was correctly conveyed in the comment. The combined result is that each entry in wordlist is prefixed by the ^ character and suffixed by a TAB. This is in preparation for making the word into a regular expression that matches that word in the corpus, a regex that says "match this word at the beginning of a line and followed by a tab". Without the ^ then 'cat' would match 'concatenate' and without the trailing tab, 'con' would match 'concatenate'.

That output is fed to grep, which is given the "-f -" option which means take a list of regular expressions from standard input, which in the case is the output of the previous command through the pipe. grep applies each of those regexes to each line of the input file and prints lines that matches. That output is sent to sort, which is given options -r (reverse order), -n (numeric sort), and -k5 meaning treat the sort key as field 5 through to the end of the line. (Note that in general the -k option takes a START,END syntax and omitting END means from START to the end of the line, which is usually not what you mean, so -k5,5 would be the preferred way of saying this but in this case the fifth field was the last field so it didn't matter.) In most *nix programs short options can be combined into one, so "-a -b -cFOO" is the same as "-abcFOO". The result is the matched lines in the corpus sorted in descending order by frequency, which gets redirected to the .txt file with the '>' operator.
posted by Rhomboid at 1:27 AM on March 29, 2011

The nice thing about little scripts like this how easy it is to break them apart and see what the parts do. So if we start by making a little wordlist, like this:



cat <<. >wordlist

apple

banana

catalog

dormant

eagle

fruit

goose

hat

icicle

.

then we can check out what the first part does, like so:



sed 's/.*/^&\t/' wordlist

Note that I've simplified the sed script a little more. It still uses the s/pattern/replacement/ command to substitute replacement for pattern on every input line, but it now uses it only once. The pattern is .* which means "any character (.) any number of times (*)" or in other words "everything on the input line", and the replacement is a literal ^ character, followed by whatever was just matched (&) followed by a tab (\t). The effect is to add a ^ to the front of each input line and a tab to the end, and running the command lets us see that:

^apple	
^banana	
^catalog	
^dormant	
^eagle	
^fruit	
^goose	
^hat	
^icicle

We can also pipe that through cat -t to make sure that each word really is followed by a tab character:



sed 's/.*/^&\t/' wordlist | cat -t

displays

^apple^I
^banana^I
^catalog^I
^dormant^I
^eagle^I
^fruit^I
^goose^I
^hat^I
^icicle^I

So we've turned our word list into a list of regular expressions that match each of the words from the wordlist, at the start of a line, immediately followed by a tab character. Feeding that into grep ("Get Regular Expression and Print") and running it against the great big google file, like this



sed 's/.*/^&\t/' wordlist | grep -f - google-1gram-frequencies-since-1950.txt

yields this:

apple	      299124	220737	87122	7.697567020830
banana	      120007	86497	35304	3.088224032404
catalog	      189080	122189	43304	4.865727832934
dormant	       98009	88805	53763	2.522134118775
eagle	      152601	112901	49805	3.926988222094
fruit	     1101184	790600	180224	28.337537751119
goose	       96921	77156	44265	2.494135854114
hat	      770770	613786	141106	19.834763284274
icicle	        4372	3934	3331	0.112507732630

And when we run that through sort, asking it to sort in reverse order, numerically, on the fifth tab-delimited field, like this:



sed 's/.*/^&\t/' wordlist | grep -f - google-1gram-frequencies-since-1950.txt | sort -rnk5

we get this:

fruit	     1101184	790600	180224	28.337537751119
hat	      770770	613786	141106	19.834763284274
apple	      299124	220737	87122	7.697567020830
catalog	      189080	122189	43304	4.865727832934
eagle	      152601	112901	49805	3.926988222094
banana	      120007	86497	35304	3.088224032404
dormant	       98009	88805	53763	2.522134118775
goose	       96921	77156	44265	2.494135854114
icicle	        4372	3934	3331	0.112507732630

which is exactly what will get written out to my-words-sorted-by-frequency.txt when we add the redirect to the end of our pipeline:



sed 's/.*/^&\t/' wordlist | grep -f - google-1gram-frequencies-since-1950.txt | sort -rnk5 > my-words-sorted-by-frequency.txt

The nasty thing about these little scripts is that they're quite cryptic, and there's very little consistency across shell commands for options they accept and what those options do. But if you issue the following commands



man cat

man grep

man sort

man sed

you'll get concise explanations of what cat, grep, sort and sed do, and their various options.

This stuff should work in any Unix (including Mac) or Linux environment without needing anything else installed; cat, grep, sort and sed are basic text-processing tools that are typically included with every Unix-like system. You can also install them into Windows in various ways; the most usable is probably Cygwin, which also gives you the bash shell that does all the pipelining and redirection stuff shown above.
posted by flabdablet at 3:11 AM on March 29, 2011

Oh, by the way: if all you want is the words sorted by frequency, and you don't want the actual frequencies, you can filter the output of sort even further before writing it to your output file:



sed 's/.*/^&\t/' wordlist | grep -f- google-1gram-frequencies-since-1950.txt | sort -rnk5 | cut -f1 > my-words-sorted-by-frequency.txt

will get you an output file containing only

fruit
hat
apple
catalog
eagle
banana
dormant
goose
icicle

posted by flabdablet at 3:17 AM on March 29, 2011

I'd also like to add that while this method is rather straightforward, it's not very optimal in terms of speed if either the corpus or the wordlist is large. On my (admittedly very old) system, using flabdablet's example wordfile with 9 entries against the 3.2M corpus takes about 97 seconds with a hot cache. You can sort the corpus and take advantage of that in the lookup to great effect using a binary search:

$ tail -n +2 google-1gram-frequencies-since-1950.txt | LC_COLLATE=POSIX sort -k1,1 >google-1gram-frequencies-since-1950-sorted.txt

This creates a sorted version of the corpus using the POSIX collation rules. The tail command is simply to omit the first header line which shouldn't be included in the corpus. Specifying what collation order to use is important because every program must agree on the rules in order for this to work and POSIX (aka 'C') is a simple default that everything supports. Then you can do a lookup like this:

$ perl -MFile::SortedSeek=:all -ne 'BEGIN { open W, "<google-1gram-frequencies-since-1950-sorted.txt" or die; } chomp; alphabetic(*W, $_, sub {(split /\t/, shift)[0]}); print scalar <W> if File::SortedSeek::was_exact()' wordlist | sort -rnk5

With a hot cache on my (again, very slow) system this takes only about 67ms, or around 1400 times faster that applying a bunch of regexes to every line. Even if you include the time to do the sort (about 11 seconds) that's still at least 9 times faster, and you only have to sort once.

Now I cheated a little bit here and used the File::SortedSeek perl module which is not installed by default and which you'd have to install before being able to run that command. For some people that might be a complete deal breaker, but it's really not that hard. As root (i.e. using sudo or su -c or however you prefer to do it on your system) run cpan File::SortedSeek and answer the questions; for most or all of them you can just hit enter to use the default. If that's too much hassle check out cpanminus which is an even simpler interface to CPAN which asks you no questions and just does the right thing.
posted by Rhomboid at 6:46 AM on March 29, 2011

Whoa, flabdablet and Rhomboid rock. Note that the first four Google columns are separated by a bunch of spaces, not tabs, so flabdablet's 's/.*/^&\t/' should have a literal space ' ' instead of the \t in order for the patterns to match on the particular file I've provided. So 's/.*/^& /'. The spaces are not too unusual for unix pipeable output — sort's column numbering considers *any* whitespace by default, space or tab — and I've muddied the waters a bit by appending my frequency column with a tab instead.

My excuse for using cat: in my actual command line I was doing head -200 /usr/share/dict/words to just grab a bunch of arbitrary words, then I swapped it out with cat to avoid introducing other typos ("...my privilege to waste processes!"). Chicken Boolean, head and tail show the first N lines of a file you tell them to look at, and /usr/share/dict/words is the OS X location of a big wordlist used by various Unix system/spelling utilities.

...though yes I could have used sed instead of head in the first place: sed 's/.*/^& /; 200q' wordlist...
posted by xueexueg at 6:49 AM on March 29, 2011

Oh, clearly I'm wrong about my file using only spaces: \t is correct for this input. I've been working from an even-messier local version that uses spaces, but I did make some token attempts to clean up the version I've been distributing. Though now that I check I see that I'm actually giving you [TAB][SPACE][SPACE][SPACE][SPACE][SPACE][SPACE][SPACE][SPACE] after the first column, so "clean" is not a word that applies to this either.
posted by xueexueg at 7:02 AM on March 29, 2011

It is actually pretty clean: there is exactly one tab between each pair of columns, and all the numeric columns are left-blank-padded toequal width.
posted by flabdablet at 5:41 PM on March 29, 2011

Playing with these tools a bit shows that grep (I'm using Gnu grep) is indeed very slow for this job:

sed 's/.*/^&\t/' wordlist | time grep -f- google-1gram-frequencies-since-1950.txt
apple         299124    220737  87122   7.697567020830
banana        120007    86497   35304   3.088224032404
catalog       189080    122189  43304   4.865727832934
dormant        98009    88805   53763   2.522134118775
eagle         152601    112901  49805   3.926988222094
fruit        1101184    790600  180224  28.337537751119
goose          96921    77156   44265   2.494135854114
hat           770770    613786  141106  19.834763284274
icicle          4372    3934    3331    0.112507732630
109.63user 0.06system 1:49.69elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+283minor)pagefaults 0swaps

Even sed beats it hands down:

sed 's!.*!/^&\t/p!' wordlist | time sed -nf- google-1gram-frequencies-since-1950.txt
apple         299124    220737  87122   7.697567020830
banana        120007    86497   35304   3.088224032404
catalog       189080    122189  43304   4.865727832934
dormant        98009    88805   53763   2.522134118775
eagle         152601    112901  49805   3.926988222094
fruit        1101184    790600  180224  28.337537751119
goose          96921    77156   44265   2.494135854114
hat           770770    613786  141106  19.834763284274
icicle          4372    3934    3331    0.112507732630
18.53user 0.10system 0:18.64elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+263minor)pagefaults 0swaps

This version (with sed instead of grep) takes exactly the same approach as the original grep version: the first invocation of sed (the one applied to "wordlist") creates an intermediate script that's fed to the second, which does the actual filtering. The output of

sed 's!.*!/^&\t/p!' wordlist

where s!pattern!replacement! means "substitute replacement for pattern on all input lines", .* means "the entire input line", and /^&\t/p means "literal /^ followed by the matched pattern (&) followed by a tab (\t) and literal /p" is

/^apple /p
/^banana        /p
/^catalog       /p
/^dormant       /p
/^eagle /p
/^fruit /p
/^goose /p
/^hat   /p
/^icicle        /p

This is itself a sed script, meaning "if the input line begins with 'apple ', print it; if it begins with 'banana ', print it; if it begins with 'catalog ', print it; ... if it begins with 'icicle ', print it." This script is piped from the first sed's standard output to the second sed's standard input, the second sed being invoked with option -f- which means "read the script from a file (-f) and specify standard input (-) as the file name"). That second sed also has the -n option, which makes it not print any input line unless explicitly instructed to by its script. The net effect is to make the second sed output every line from the frequency file that starts with a word from the wordlist, which is exactly same job grep is told to do in the first version; sed just does it five times faster.

Still not a patch on Rhomboid's perl solution, which is much more efficient because it uses a better algorithm: it runs a nice binary search against the corpus words instead of brute-force matching the entire input wordlist against all 3181243 of them in turn. The computer I'm using for my speed tests also takes about 9 seconds to do Rhomboid's pre-sort, for what that's worth.
posted by flabdablet at 7:50 PM on March 29, 2011

...aaaand I've just found the look command:

look -b STRING FILE

does a binary search inside FILE for lines beginning with STRING, which is exactly what we need for this job. So let's build stepwise toward the commands we need, and see how it performs compared to the perl solution.

First thing is to clean up the corpus file a little. We want to get rid of the header in the first line, and it would also be nice to clean up the whitespace so that all that separates columns is a single tab character. Let's do our testing based on the first few lines of what we have to start with:

head google-1gram-frequencies-since-1950.txt 
word                 token_count      page_count      volume_count      freq_per_million
00000000000000000000000000000000	         316	52	34	0.008131848927
000000000000000000000000000000	          38	30	28	0.000977880567
0000000000000000000000000000	          75	35	34	0.001930027435
000000000000000000000000000	          44	43	32	0.001132282762
00000000000000000000000000	          55	42	34	0.001415353453
0000000000000000000000000	          66	49	47	0.001698424143
000000000000000000000000	          86	76	74	0.002213098126
00000000000000000000000	          84	55	47	0.002161630728
0000000000000000000000	          92	82	78	0.002367500321

Run that through cat -t, so that tab characters show up as ^I:

head google-1gram-frequencies-since-1950.txt | cat -t
word                 token_count      page_count      volume_count      freq_per_million
00000000000000000000000000000000^I         316^I52^I34^I0.008131848927
000000000000000000000000000000^I          38^I30^I28^I0.000977880567
0000000000000000000000000000^I          75^I35^I34^I0.001930027435
000000000000000000000000000^I          44^I43^I32^I0.001132282762
00000000000000000000000000^I          55^I42^I34^I0.001415353453
0000000000000000000000000^I          66^I49^I47^I0.001698424143
000000000000000000000000^I          86^I76^I74^I0.002213098126
00000000000000000000000^I          84^I55^I47^I0.002161630728
0000000000000000000000^I          92^I82^I78^I0.002367500321

So it looks like all the cleaning that's required is to strip the first line, then eliminate any spaces that follow the first tab. That's an easy job for sed:

head google-1gram-frequencies-since-1950.txt | sed '1d;s/\t */\t/' | cat -t
00000000000000000000000000000000^I316^I52^I34^I0.008131848927
000000000000000000000000000000^I38^I30^I28^I0.000977880567
0000000000000000000000000000^I75^I35^I34^I0.001930027435
000000000000000000000000000^I44^I43^I32^I0.001132282762
00000000000000000000000000^I55^I42^I34^I0.001415353453
0000000000000000000000000^I66^I49^I47^I0.001698424143
000000000000000000000000^I86^I76^I74^I0.002213098126
00000000000000000000000^I84^I55^I47^I0.002161630728
0000000000000000000000^I92^I82^I78^I0.002367500321

That seems to work. Why?

Look at the line noise following the sed command: '1d;s/\t */\t/'

It's wrapped in single quotes, which tells the bash shell to (a) leave everything inside the single quotes strictly as-is and (b) hand off the entire string so quoted as a single argument to the command it's invoking (sed). So what sed will see, and interpret as a command script, is exactly

1d;s/\t */\t/

That's two sed commands (the ; separates them). The first command - 1d - tells sed to delete the first line of input entirely; that gets rid of the unwanted header line. The second - s/\t */\t/ - tells sed to substitute (on every line, because there's no addressing prefix before the s) a single tab (\t) for the first occurrence of a single tab followed by any number of spaces (\t *).

OK. So, for a binary search to work properly, we're going to want a file sorted in POSIX collation order:

head google-1gram-frequencies-since-1950.txt | sed '1d;s/\t */\t/' | LC_COLLATE=POSIX sort | cat -t
0000000000000000000000^I92^I82^I78^I0.002367500321
00000000000000000000000^I84^I55^I47^I0.002161630728
000000000000000000000000^I86^I76^I74^I0.002213098126
0000000000000000000000000^I66^I49^I47^I0.001698424143
00000000000000000000000000^I55^I42^I34^I0.001415353453
000000000000000000000000000^I44^I43^I32^I0.001132282762
0000000000000000000000000000^I75^I35^I34^I0.001930027435
000000000000000000000000000000^I38^I30^I28^I0.000977880567
00000000000000000000000000000000^I316^I52^I34^I0.008131848927

Lovely.

This is the same technique as Rhomboid used, except that I haven't bothered telling sort to break input lines into separate key fields and sort on the first one; in POSIX collation order the tab character that delimits the first column sorts before any printable character, which means that simply sorting on entire lines will do the same job.

So now let's run the sed and the sort over the whole file and make a new output file:

sed '1d;s/\t */\t/' google-1gram-frequencies-since-1950.txt | LC_COLLATE=POSIX sort >cleaned.txt

25 seconds later on the dear old machine I'm using today, I now have a new, 106359200 byte file with the corpus cleaned as required. So it's time to run some look commands against that and see what happens.

look penguin cleaned.txt
penguin	13556	9309	4161	0.348846025509
penguin's	367	327	253	0.009444267583
penguinclassics	83	83	73	0.002135897028
penguinputnam	254	237	195	0.006536359581
penguins	18172	11653	4532	0.467632780728

Shit, that was quick! How quick, on this old cheap feeble computenbox?

time look penguin cleaned.txt
penguin	13556	9309	4161	0.348846025509
penguin's	367	327	253	0.009444267583
penguinclassics	83	83	73	0.002135897028
penguinputnam	254	237	195	0.006536359581
penguins	18172	11653	4532	0.467632780728

real	0m0.354s
user	0m0.272s
sys	0m0.084s

Huh. Not too shabby, but 354 milliseconds is still the wrong order of magnitude compared to Rhomboid's reported 67ms for ten lookups. Could it be that my old computer is ten times slower than his old computer?

But wait! I have forgotten the look command's -b option, and it's doing a linear string search! So what happens when I actually do the binary search that sorting the file made possible?

time look -b penguin cleaned.txt
penguin	13556	9309	4161	0.348846025509
penguin's	367	327	253	0.009444267583
penguinclassics	83	83	73	0.002135897028
penguinputnam	254	237	195	0.006536359581
penguins	18172	11653	4532	0.467632780728

real	0m0.003s
user	0m0.000s
sys	0m0.000s

Ho.
Ly.
Shit.

Now we have some cleaning up to do. First thing is to whack a tab character on the end of the search string so that look will only match it as a whole word:
time look -b penguin'\t' cleaned.txt

real 0m0.003s
user 0m0.000s
sys 0m0.004s
Huh! No search output. Looks like look doesn't interpret \t as tab. No matter; by using a handy $'bashism' instead of the bare single 'quotes', I can get bash to pre-interpret it:

look -b penguin$'\t' cleaned.txt
penguin	13556	9309	4161	0.348846025509

Better. Now let's apply look to each word in our wordlist, and time the whole thing for good measure:

time while read -r word; do look -b "$word"$'\t' cleaned.txt; done <wordlist
apple	299124	220737	87122	7.697567020830
banana	120007	86497	35304	3.088224032404
catalog	189080	122189	43304	4.865727832934
dormant	98009	88805	53763	2.522134118775
eagle	152601	112901	49805	3.926988222094
fruit	1101184	790600	180224	28.337537751119
goose	96921	77156	44265	2.494135854114
hat	770770	613786	141106	19.834763284274
icicle	4372	3934	3331	0.112507732630

real	0m0.024s
user	0m0.008s
sys	0m0.016s

Outstanding! And no extra tools required.

Get rid of the timer and pipe the result through sort and cut as we did yesterday, and we have

while read -r word; do look -b "$word"$'\t' cleaned.txt; done <wordlist | sort -rnk5 | cut -f1
fruit
hat
apple
catalog
eagle
banana
dormant
goose
icicle

pretty much instantly.
posted by flabdablet at 10:29 PM on March 30, 2011

Wow, you guys put a lot of work into this! I wish I had seen it earlier. I used Python, rather than a shell script, with xueexueg's Google 1-grams (binary-searching like flabdablet suggests). My solution was obviously much longer than the one-liner above.

I had no idea about the "look" command (and I didn't know much about sed either), so this was educational. Thanks!
posted by Chicken Boolean at 10:51 AM on April 2, 2011

My solution was obviously much longer than the one-liner above

What do you expect, if you insist on coding in these low-level languages? :-)
posted by flabdablet at 6:48 PM on April 2, 2011

« Older Etymology of "ebrius"? | Of The Month Newer »

This thread is closed to new comments.

Ask MetaFilter

Word Frequency Counts
March 28, 2011 4:31 PM

Tags

Share

Word Frequency Counts March 28, 2011 4:31 PM

Tags

Share

Word Frequency Counts
March 28, 2011 4:31 PM