Word Frequency Counts
March 28, 2011 4:31 PM
Can I compute how frequently a word occurs in general English text? I have a list of about 2000 words, and I want to sort it with the most common words first.
I know the dumb way to do this: download some huge corpus and count count count count count. Two problems with that are: (1) most words on my list are pretty obscure so I expect to get a lot of zeros; (2) it'll be very slow. Counting Google hits is out because Google doesn't allow automated searches.
I'm hoping there's some kind of online database that I can query, like WordNet—except WordNet seems to have frequency counts only for extremely common words. Or if you have any other ideas, let me know. Thanks!
I know the dumb way to do this: download some huge corpus and count count count count count. Two problems with that are: (1) most words on my list are pretty obscure so I expect to get a lot of zeros; (2) it'll be very slow. Counting Google hits is out because Google doesn't allow automated searches.
I'm hoping there's some kind of online database that I can query, like WordNet—except WordNet seems to have frequency counts only for extremely common words. Or if you have any other ideas, let me know. Thanks!
Adam Kilgarriff's British National Corpus word frequency lists in various formats.
posted by qxntpqbbbqxl at 4:41 PM on March 28, 2011
posted by qxntpqbbbqxl at 4:41 PM on March 28, 2011
I know there's really no such thing as "general English text." This doesn't need to be exact, so pretty much any kind of text would work, if there's enough of it. I think the BNC list is probably too small.
Google Ngrams looks great, but does it allow automated searching online?
posted by Chicken Boolean at 4:47 PM on March 28, 2011
Google Ngrams looks great, but does it allow automated searching online?
posted by Chicken Boolean at 4:47 PM on March 28, 2011
The Google Ngram dataset is available for download. You probably only want the 1-grams.
posted by teraflop at 4:57 PM on March 28, 2011
posted by teraflop at 4:57 PM on March 28, 2011
I asked a similar question a while back. Maybe some of the resources there will be helpful to you.
posted by aparrish at 5:09 PM on March 28, 2011
posted by aparrish at 5:09 PM on March 28, 2011
The Google 1-grams are broken out year-by-year so you'd have to go through and get the sum and then sort those resulting numbers. For my own needs I've already done this, looking only at occurrences after 1950. If that suits your needs you can get it from me (36MB gzipped, 3,181,244 "words") at
http://dictionaut.com/google-1gram-frequencies-since-1950.txt.gz
The columns are all Google's except the last one which I calculated:
word | token_count | page_count | volume_count | freq_per_million
3 million lines should clue you in that there's a lot of noise and it's too big to open in Excel.
posted by xueexueg at 5:13 PM on March 28, 2011
http://dictionaut.com/google-1gram-frequencies-since-1950.txt.gz
The columns are all Google's except the last one which I calculated:
word | token_count | page_count | volume_count | freq_per_million
3 million lines should clue you in that there's a lot of noise and it's too big to open in Excel.
posted by xueexueg at 5:13 PM on March 28, 2011
I saw that Google Ngrams can be downloaded, but I'd prefer a data set that can be searched online, if one exists. Google's onlne viewer shows its results as an image, so I can't even try to search it with a script.
(I just downloaded Google's 1-grams; 2- or 3-grams would also be useful, but the 2-grams are about 25 GB and the 3-grams are impossibly huge.)
posted by Chicken Boolean at 5:15 PM on March 28, 2011
(I just downloaded Google's 1-grams; 2- or 3-grams would also be useful, but the 2-grams are about 25 GB and the 3-grams are impossibly huge.)
posted by Chicken Boolean at 5:15 PM on March 28, 2011
The Corpus of Contemporary American English (COCA) will do exactly what you want and more. If I wasn't under a massive time crunch to finish up an eerily similar task, I would join your list with one of their frequency lists (although I don't think they have a free one that is just texts) for you. If you can get a hold of a frequency list, a little SQL magic will extract the counts you need for your list.
I don't know how specific/accurate you want, but there are quite a few problems with the Google Ngram...mostly with bad OCR recognition. Also, do you want all varieties of English, or something more specific (American, British, etc.)?
posted by iamkimiam at 5:16 PM on March 28, 2011
I don't know how specific/accurate you want, but there are quite a few problems with the Google Ngram...mostly with bad OCR recognition. Also, do you want all varieties of English, or something more specific (American, British, etc.)?
posted by iamkimiam at 5:16 PM on March 28, 2011
Thanks xueexueg. And I'm checking out COCA right now. I want all varieties of English and I don't care much about accuracy. (So many answers so fast!)
posted by Chicken Boolean at 5:18 PM on March 28, 2011
posted by Chicken Boolean at 5:18 PM on March 28, 2011
Oh, and the companion site, Corpus of Historical American English (COHA)...it's texts!
posted by iamkimiam at 5:35 PM on March 28, 2011
posted by iamkimiam at 5:35 PM on March 28, 2011
If you're in some flavor of Unix you can query my list for your list fairly cheaply: if your wordlist has one word per line, you could do
cat wordlist | sed 's/^/\^/g;s/$/ /g' | grep -f - google-1gram-frequencies-since-1950.txt | sort -rnk5 > my-words-sorted-by-frequency.txt
posted by xueexueg at 6:28 PM on March 28, 2011
cat wordlist | sed 's/^/\^/g;s/$/ /g' | grep -f - google-1gram-frequencies-since-1950.txt | sort -rnk5 > my-words-sorted-by-frequency.txt
posted by xueexueg at 6:28 PM on March 28, 2011
sed 's/^/\^/g;s/$/ /g' wordlist | grep -f - google-1gram-frequencies-since-1950.txt | sort -rnk5 > my-words-sorted-by-frequency.txt
is even cheaper :-)
posted by flabdablet at 10:15 PM on March 28, 2011
That is really beautiful xueexueg and flabdablet!
I'd like to take that code an apply it to other files (I'm also working on corpus list comparisons and many of my files are much too 'messy' and big for excel). Would one of you (or any other onlookers) mind giving a quick translation of what the elements in the string mean (for ex. I think '|' means 'pipe', which means 'feed this into that')? That would be super helpful, thanks!
posted by iamkimiam at 11:35 PM on March 28, 2011
I'd like to take that code an apply it to other files (I'm also working on corpus list comparisons and many of my files are much too 'messy' and big for excel). Would one of you (or any other onlookers) mind giving a quick translation of what the elements in the string mean (for ex. I think '|' means 'pipe', which means 'feed this into that')? That would be super helpful, thanks!
posted by iamkimiam at 11:35 PM on March 28, 2011
sed is the stream editor, and it makes edits to each line of the input (in the case wordlist). s/FOO/BAR/g means search for an occurrence of the regular expression FOO and replace it with BAR. The /g on the end means do it for every occurrence of FOO in the line (although its use here is redundant and not needed.) Multiple s///g expressions can be given to sed, separated by semicolons, and they are executed in series. The regular expression ^ matches the beginning of a line, and the replace text \^ means a literal carat (in this case the \ is redundant and not necessary because the replace text is not a regular expression but in general it means quote the next character to mean a literal value instead of something that has meaning as a metacharacter.) The regular expression $ matches at the end of a line, and the replacement is a space -- actually this should be a TAB character (typable by Ctrl-V then TAB) but I don't think that was correctly conveyed in the comment. The combined result is that each entry in wordlist is prefixed by the ^ character and suffixed by a TAB. This is in preparation for making the word into a regular expression that matches that word in the corpus, a regex that says "match this word at the beginning of a line and followed by a tab". Without the ^ then 'cat' would match 'concatenate' and without the trailing tab, 'con' would match 'concatenate'.
That output is fed to grep, which is given the "-f -" option which means take a list of regular expressions from standard input, which in the case is the output of the previous command through the pipe. grep applies each of those regexes to each line of the input file and prints lines that matches. That output is sent to sort, which is given options -r (reverse order), -n (numeric sort), and -k5 meaning treat the sort key as field 5 through to the end of the line. (Note that in general the -k option takes a START,END syntax and omitting END means from START to the end of the line, which is usually not what you mean, so -k5,5 would be the preferred way of saying this but in this case the fifth field was the last field so it didn't matter.) In most *nix programs short options can be combined into one, so "-a -b -cFOO" is the same as "-abcFOO". The result is the matched lines in the corpus sorted in descending order by frequency, which gets redirected to the .txt file with the '>' operator.
posted by Rhomboid at 1:27 AM on March 29, 2011
That output is fed to grep, which is given the "-f -" option which means take a list of regular expressions from standard input, which in the case is the output of the previous command through the pipe. grep applies each of those regexes to each line of the input file and prints lines that matches. That output is sent to sort, which is given options -r (reverse order), -n (numeric sort), and -k5 meaning treat the sort key as field 5 through to the end of the line. (Note that in general the -k option takes a START,END syntax and omitting END means from START to the end of the line, which is usually not what you mean, so -k5,5 would be the preferred way of saying this but in this case the fifth field was the last field so it didn't matter.) In most *nix programs short options can be combined into one, so "-a -b -cFOO" is the same as "-abcFOO". The result is the matched lines in the corpus sorted in descending order by frequency, which gets redirected to the .txt file with the '>' operator.
posted by Rhomboid at 1:27 AM on March 29, 2011
The nice thing about little scripts like this how easy it is to break them apart and see what the parts do. So if we start by making a little wordlist, like this:
then we can check out what the first part does, like so:
Note that I've simplified the sed script a little more. It still uses the s/pattern/replacement/ command to substitute replacement for pattern on every input line, but it now uses it only once. The pattern is .* which means "any character (.) any number of times (*)" or in other words "everything on the input line", and the replacement is a literal ^ character, followed by whatever was just matched (&) followed by a tab (\t). The effect is to add a ^ to the front of each input line and a tab to the end, and running the command lets us see that:
displays
yields this:
we get this:
The nasty thing about these little scripts is that they're quite cryptic, and there's very little consistency across shell commands for options they accept and what those options do. But if you issue the following commands
you'll get concise explanations of what cat, grep, sort and sed do, and their various options.
This stuff should work in any Unix (including Mac) or Linux environment without needing anything else installed; cat, grep, sort and sed are basic text-processing tools that are typically included with every Unix-like system. You can also install them into Windows in various ways; the most usable is probably Cygwin, which also gives you the bash shell that does all the pipelining and redirection stuff shown above.
posted by flabdablet at 3:11 AM on March 29, 2011
cat <<. >wordlist
apple
banana
catalog
dormant
eagle
fruit
goose
hat
icicle
.
then we can check out what the first part does, like so:
sed 's/.*/^&\t/' wordlist
Note that I've simplified the sed script a little more. It still uses the s/pattern/replacement/ command to substitute replacement for pattern on every input line, but it now uses it only once. The pattern is .* which means "any character (.) any number of times (*)" or in other words "everything on the input line", and the replacement is a literal ^ character, followed by whatever was just matched (&) followed by a tab (\t). The effect is to add a ^ to the front of each input line and a tab to the end, and running the command lets us see that:
^apple ^banana ^catalog ^dormant ^eagle ^fruit ^goose ^hat ^icicleWe can also pipe that through
cat -t
to make sure that each word really is followed by a tab character:
sed 's/.*/^&\t/' wordlist | cat -t
displays
^apple^I ^banana^I ^catalog^I ^dormant^I ^eagle^I ^fruit^I ^goose^I ^hat^I ^icicle^ISo we've turned our word list into a list of regular expressions that match each of the words from the wordlist, at the start of a line, immediately followed by a tab character. Feeding that into grep ("Get Regular Expression and Print") and running it against the great big google file, like this
sed 's/.*/^&\t/' wordlist | grep -f - google-1gram-frequencies-since-1950.txt
yields this:
apple 299124 220737 87122 7.697567020830 banana 120007 86497 35304 3.088224032404 catalog 189080 122189 43304 4.865727832934 dormant 98009 88805 53763 2.522134118775 eagle 152601 112901 49805 3.926988222094 fruit 1101184 790600 180224 28.337537751119 goose 96921 77156 44265 2.494135854114 hat 770770 613786 141106 19.834763284274 icicle 4372 3934 3331 0.112507732630And when we run that through sort, asking it to sort in reverse order, numerically, on the fifth tab-delimited field, like this:
sed 's/.*/^&\t/' wordlist | grep -f - google-1gram-frequencies-since-1950.txt | sort -rnk5
we get this:
fruit 1101184 790600 180224 28.337537751119 hat 770770 613786 141106 19.834763284274 apple 299124 220737 87122 7.697567020830 catalog 189080 122189 43304 4.865727832934 eagle 152601 112901 49805 3.926988222094 banana 120007 86497 35304 3.088224032404 dormant 98009 88805 53763 2.522134118775 goose 96921 77156 44265 2.494135854114 icicle 4372 3934 3331 0.112507732630which is exactly what will get written out to my-words-sorted-by-frequency.txt when we add the redirect to the end of our pipeline:
sed 's/.*/^&\t/' wordlist | grep -f - google-1gram-frequencies-since-1950.txt | sort -rnk5 > my-words-sorted-by-frequency.txt
The nasty thing about these little scripts is that they're quite cryptic, and there's very little consistency across shell commands for options they accept and what those options do. But if you issue the following commands
man cat
man grep
man sort
man sed
you'll get concise explanations of what cat, grep, sort and sed do, and their various options.
This stuff should work in any Unix (including Mac) or Linux environment without needing anything else installed; cat, grep, sort and sed are basic text-processing tools that are typically included with every Unix-like system. You can also install them into Windows in various ways; the most usable is probably Cygwin, which also gives you the bash shell that does all the pipelining and redirection stuff shown above.
posted by flabdablet at 3:11 AM on March 29, 2011
Oh, by the way: if all you want is the words sorted by frequency, and you don't want the actual frequencies, you can filter the output of sort even further before writing it to your output file:
will get you an output file containing only
posted by flabdablet at 3:17 AM on March 29, 2011
sed 's/.*/^&\t/' wordlist | grep -f- google-1gram-frequencies-since-1950.txt | sort -rnk5 | cut -f1 > my-words-sorted-by-frequency.txt
will get you an output file containing only
fruit hat apple catalog eagle banana dormant goose icicle
posted by flabdablet at 3:17 AM on March 29, 2011
I'd also like to add that while this method is rather straightforward, it's not very optimal in terms of speed if either the corpus or the wordlist is large. On my (admittedly very old) system, using flabdablet's example wordfile with 9 entries against the 3.2M corpus takes about 97 seconds with a hot cache. You can sort the corpus and take advantage of that in the lookup to great effect using a binary search:
$ tail -n +2 google-1gram-frequencies-since-1950.txt | LC_COLLATE=POSIX sort -k1,1 >google-1gram-frequencies-since-1950-sorted.txt
This creates a sorted version of the corpus using the POSIX collation rules. The tail command is simply to omit the first header line which shouldn't be included in the corpus. Specifying what collation order to use is important because every program must agree on the rules in order for this to work and POSIX (aka 'C') is a simple default that everything supports. Then you can do a lookup like this:
$ perl -MFile::SortedSeek=:all -ne 'BEGIN { open W, "<google-1gram-frequencies-since-1950-sorted.txt" or die; } chomp; alphabetic(*W, $_, sub {(split /\t/, shift)[0]}); print scalar <W> if File::SortedSeek::was_exact()' wordlist | sort -rnk5
With a hot cache on my (again, very slow) system this takes only about 67ms, or around 1400 times faster that applying a bunch of regexes to every line. Even if you include the time to do the sort (about 11 seconds) that's still at least 9 times faster, and you only have to sort once.
Now I cheated a little bit here and used the File::SortedSeek perl module which is not installed by default and which you'd have to install before being able to run that command. For some people that might be a complete deal breaker, but it's really not that hard. As root (i.e. using sudo or su -c or however you prefer to do it on your system) run cpan File::SortedSeek and answer the questions; for most or all of them you can just hit enter to use the default. If that's too much hassle check out cpanminus which is an even simpler interface to CPAN which asks you no questions and just does the right thing.
posted by Rhomboid at 6:46 AM on March 29, 2011
$ tail -n +2 google-1gram-frequencies-since-1950.txt | LC_COLLATE=POSIX sort -k1,1 >google-1gram-frequencies-since-1950-sorted.txt
This creates a sorted version of the corpus using the POSIX collation rules. The tail command is simply to omit the first header line which shouldn't be included in the corpus. Specifying what collation order to use is important because every program must agree on the rules in order for this to work and POSIX (aka 'C') is a simple default that everything supports. Then you can do a lookup like this:
$ perl -MFile::SortedSeek=:all -ne 'BEGIN { open W, "<google-1gram-frequencies-since-1950-sorted.txt" or die; } chomp; alphabetic(*W, $_, sub {(split /\t/, shift)[0]}); print scalar <W> if File::SortedSeek::was_exact()' wordlist | sort -rnk5
With a hot cache on my (again, very slow) system this takes only about 67ms, or around 1400 times faster that applying a bunch of regexes to every line. Even if you include the time to do the sort (about 11 seconds) that's still at least 9 times faster, and you only have to sort once.
Now I cheated a little bit here and used the File::SortedSeek perl module which is not installed by default and which you'd have to install before being able to run that command. For some people that might be a complete deal breaker, but it's really not that hard. As root (i.e. using sudo or su -c or however you prefer to do it on your system) run cpan File::SortedSeek and answer the questions; for most or all of them you can just hit enter to use the default. If that's too much hassle check out cpanminus which is an even simpler interface to CPAN which asks you no questions and just does the right thing.
posted by Rhomboid at 6:46 AM on March 29, 2011
Whoa, flabdablet and Rhomboid rock. Note that the first four Google columns are separated by a bunch of spaces, not tabs, so flabdablet's 's/.*/^&\t/' should have a literal space ' ' instead of the \t in order for the patterns to match on the particular file I've provided. So 's/.*/^& /'. The spaces are not too unusual for unix pipeable output — sort's column numbering considers *any* whitespace by default, space or tab — and I've muddied the waters a bit by appending my frequency column with a tab instead.
My excuse for using cat: in my actual command line I was doing head -200 /usr/share/dict/words to just grab a bunch of arbitrary words, then I swapped it out with cat to avoid introducing other typos ("...my privilege to waste processes!"). Chicken Boolean, head and tail show the first N lines of a file you tell them to look at, and /usr/share/dict/words is the OS X location of a big wordlist used by various Unix system/spelling utilities.
...though yes I could have used sed instead of head in the first place: sed 's/.*/^& /; 200q' wordlist...
posted by xueexueg at 6:49 AM on March 29, 2011
My excuse for using cat: in my actual command line I was doing head -200 /usr/share/dict/words to just grab a bunch of arbitrary words, then I swapped it out with cat to avoid introducing other typos ("...my privilege to waste processes!"). Chicken Boolean, head and tail show the first N lines of a file you tell them to look at, and /usr/share/dict/words is the OS X location of a big wordlist used by various Unix system/spelling utilities.
...though yes I could have used sed instead of head in the first place: sed 's/.*/^& /; 200q' wordlist...
posted by xueexueg at 6:49 AM on March 29, 2011
Oh, clearly I'm wrong about my file using only spaces: \t is correct for this input. I've been working from an even-messier local version that uses spaces, but I did make some token attempts to clean up the version I've been distributing. Though now that I check I see that I'm actually giving you [TAB][SPACE][SPACE][SPACE][SPACE][SPACE][SPACE][SPACE][SPACE] after the first column, so "clean" is not a word that applies to this either.
posted by xueexueg at 7:02 AM on March 29, 2011
posted by xueexueg at 7:02 AM on March 29, 2011
It is actually pretty clean: there is exactly one tab between each pair of columns, and all the numeric columns are left-blank-padded toequal width.
posted by flabdablet at 5:41 PM on March 29, 2011
posted by flabdablet at 5:41 PM on March 29, 2011
Playing with these tools a bit shows that grep (I'm using Gnu grep) is indeed very slow for this job:
Still not a patch on Rhomboid's perl solution, which is much more efficient because it uses a better algorithm: it runs a nice binary search against the corpus words instead of brute-force matching the entire input wordlist against all 3181243 of them in turn. The computer I'm using for my speed tests also takes about 9 seconds to do Rhomboid's pre-sort, for what that's worth.
posted by flabdablet at 7:50 PM on March 29, 2011
sed 's/.*/^&\t/' wordlist | time grep -f- google-1gram-frequencies-since-1950.txt apple 299124 220737 87122 7.697567020830 banana 120007 86497 35304 3.088224032404 catalog 189080 122189 43304 4.865727832934 dormant 98009 88805 53763 2.522134118775 eagle 152601 112901 49805 3.926988222094 fruit 1101184 790600 180224 28.337537751119 goose 96921 77156 44265 2.494135854114 hat 770770 613786 141106 19.834763284274 icicle 4372 3934 3331 0.112507732630 109.63user 0.06system 1:49.69elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+283minor)pagefaults 0swapsEven sed beats it hands down:
sed 's!.*!/^&\t/p!' wordlist | time sed -nf- google-1gram-frequencies-since-1950.txt apple 299124 220737 87122 7.697567020830 banana 120007 86497 35304 3.088224032404 catalog 189080 122189 43304 4.865727832934 dormant 98009 88805 53763 2.522134118775 eagle 152601 112901 49805 3.926988222094 fruit 1101184 790600 180224 28.337537751119 goose 96921 77156 44265 2.494135854114 hat 770770 613786 141106 19.834763284274 icicle 4372 3934 3331 0.112507732630 18.53user 0.10system 0:18.64elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+263minor)pagefaults 0swapsThis version (with sed instead of grep) takes exactly the same approach as the original grep version: the first invocation of sed (the one applied to "wordlist") creates an intermediate script that's fed to the second, which does the actual filtering. The output of
sed 's!.*!/^&\t/p!' wordlistwhere s!pattern!replacement! means "substitute replacement for pattern on all input lines", .* means "the entire input line", and /^&\t/p means "literal /^ followed by the matched pattern (&) followed by a tab (\t) and literal /p" is
/^apple /p /^banana /p /^catalog /p /^dormant /p /^eagle /p /^fruit /p /^goose /p /^hat /p /^icicle /pThis is itself a sed script, meaning "if the input line begins with 'apple ', print it; if it begins with 'banana ', print it; if it begins with 'catalog ', print it; ... if it begins with 'icicle ', print it." This script is piped from the first sed's standard output to the second sed's standard input, the second sed being invoked with option -f- which means "read the script from a file (-f) and specify standard input (-) as the file name"). That second sed also has the -n option, which makes it not print any input line unless explicitly instructed to by its script. The net effect is to make the second sed output every line from the frequency file that starts with a word from the wordlist, which is exactly same job grep is told to do in the first version; sed just does it five times faster.
Still not a patch on Rhomboid's perl solution, which is much more efficient because it uses a better algorithm: it runs a nice binary search against the corpus words instead of brute-force matching the entire input wordlist against all 3181243 of them in turn. The computer I'm using for my speed tests also takes about 9 seconds to do Rhomboid's pre-sort, for what that's worth.
posted by flabdablet at 7:50 PM on March 29, 2011
...aaaand I've just found the look command:
First thing is to clean up the corpus file a little. We want to get rid of the header in the first line, and it would also be nice to clean up the whitespace so that all that separates columns is a single tab character. Let's do our testing based on the first few lines of what we have to start with:
Look at the line noise following the sed command: '1d;s/\t */\t/'
It's wrapped in single quotes, which tells the bash shell to (a) leave everything inside the single quotes strictly as-is and (b) hand off the entire string so quoted as a single argument to the command it's invoking (sed). So what sed will see, and interpret as a command script, is exactly
1d;s/\t */\t/
That's two sed commands (the ; separates them). The first command - 1d - tells sed to delete the first line of input entirely; that gets rid of the unwanted header line. The second - s/\t */\t/ - tells sed to substitute (on every line, because there's no addressing prefix before the s) a single tab (\t) for the first occurrence of a single tab followed by any number of spaces (\t *).
OK. So, for a binary search to work properly, we're going to want a file sorted in POSIX collation order:
This is the same technique as Rhomboid used, except that I haven't bothered telling sort to break input lines into separate key fields and sort on the first one; in POSIX collation order the tab character that delimits the first column sorts before any printable character, which means that simply sorting on entire lines will do the same job.
So now let's run the sed and the sort over the whole file and make a new output file:
But wait! I have forgotten the look command's -b option, and it's doing a linear string search! So what happens when I actually do the binary search that sorting the file made possible?
Ly.
Shit.
Now we have some cleaning up to do. First thing is to whack a tab character on the end of the search string so that look will only match it as a whole word:
time look -b penguin'\t' cleaned.txt
real 0m0.003s
user 0m0.000s
sys 0m0.004s
Huh! No search output. Looks like look doesn't interpret \t as tab. No matter; by using a handy $'bashism' instead of the bare single 'quotes', I can get bash to pre-interpret it:
Get rid of the timer and pipe the result through sort and cut as we did yesterday, and we have
posted by flabdablet at 10:29 PM on March 30, 2011
look -b STRING FILEdoes a binary search inside FILE for lines beginning with STRING, which is exactly what we need for this job. So let's build stepwise toward the commands we need, and see how it performs compared to the perl solution.
First thing is to clean up the corpus file a little. We want to get rid of the header in the first line, and it would also be nice to clean up the whitespace so that all that separates columns is a single tab character. Let's do our testing based on the first few lines of what we have to start with:
head google-1gram-frequencies-since-1950.txt word token_count page_count volume_count freq_per_million 00000000000000000000000000000000 316 52 34 0.008131848927 000000000000000000000000000000 38 30 28 0.000977880567 0000000000000000000000000000 75 35 34 0.001930027435 000000000000000000000000000 44 43 32 0.001132282762 00000000000000000000000000 55 42 34 0.001415353453 0000000000000000000000000 66 49 47 0.001698424143 000000000000000000000000 86 76 74 0.002213098126 00000000000000000000000 84 55 47 0.002161630728 0000000000000000000000 92 82 78 0.002367500321Run that through cat -t, so that tab characters show up as ^I:
head google-1gram-frequencies-since-1950.txt | cat -t word token_count page_count volume_count freq_per_million 00000000000000000000000000000000^I 316^I52^I34^I0.008131848927 000000000000000000000000000000^I 38^I30^I28^I0.000977880567 0000000000000000000000000000^I 75^I35^I34^I0.001930027435 000000000000000000000000000^I 44^I43^I32^I0.001132282762 00000000000000000000000000^I 55^I42^I34^I0.001415353453 0000000000000000000000000^I 66^I49^I47^I0.001698424143 000000000000000000000000^I 86^I76^I74^I0.002213098126 00000000000000000000000^I 84^I55^I47^I0.002161630728 0000000000000000000000^I 92^I82^I78^I0.002367500321So it looks like all the cleaning that's required is to strip the first line, then eliminate any spaces that follow the first tab. That's an easy job for sed:
head google-1gram-frequencies-since-1950.txt | sed '1d;s/\t */\t/' | cat -t 00000000000000000000000000000000^I316^I52^I34^I0.008131848927 000000000000000000000000000000^I38^I30^I28^I0.000977880567 0000000000000000000000000000^I75^I35^I34^I0.001930027435 000000000000000000000000000^I44^I43^I32^I0.001132282762 00000000000000000000000000^I55^I42^I34^I0.001415353453 0000000000000000000000000^I66^I49^I47^I0.001698424143 000000000000000000000000^I86^I76^I74^I0.002213098126 00000000000000000000000^I84^I55^I47^I0.002161630728 0000000000000000000000^I92^I82^I78^I0.002367500321That seems to work. Why?
Look at the line noise following the sed command: '1d;s/\t */\t/'
It's wrapped in single quotes, which tells the bash shell to (a) leave everything inside the single quotes strictly as-is and (b) hand off the entire string so quoted as a single argument to the command it's invoking (sed). So what sed will see, and interpret as a command script, is exactly
1d;s/\t */\t/
That's two sed commands (the ; separates them). The first command - 1d - tells sed to delete the first line of input entirely; that gets rid of the unwanted header line. The second - s/\t */\t/ - tells sed to substitute (on every line, because there's no addressing prefix before the s) a single tab (\t) for the first occurrence of a single tab followed by any number of spaces (\t *).
OK. So, for a binary search to work properly, we're going to want a file sorted in POSIX collation order:
head google-1gram-frequencies-since-1950.txt | sed '1d;s/\t */\t/' | LC_COLLATE=POSIX sort | cat -t 0000000000000000000000^I92^I82^I78^I0.002367500321 00000000000000000000000^I84^I55^I47^I0.002161630728 000000000000000000000000^I86^I76^I74^I0.002213098126 0000000000000000000000000^I66^I49^I47^I0.001698424143 00000000000000000000000000^I55^I42^I34^I0.001415353453 000000000000000000000000000^I44^I43^I32^I0.001132282762 0000000000000000000000000000^I75^I35^I34^I0.001930027435 000000000000000000000000000000^I38^I30^I28^I0.000977880567 00000000000000000000000000000000^I316^I52^I34^I0.008131848927Lovely.
This is the same technique as Rhomboid used, except that I haven't bothered telling sort to break input lines into separate key fields and sort on the first one; in POSIX collation order the tab character that delimits the first column sorts before any printable character, which means that simply sorting on entire lines will do the same job.
So now let's run the sed and the sort over the whole file and make a new output file:
sed '1d;s/\t */\t/' google-1gram-frequencies-since-1950.txt | LC_COLLATE=POSIX sort >cleaned.txt25 seconds later on the dear old machine I'm using today, I now have a new, 106359200 byte file with the corpus cleaned as required. So it's time to run some look commands against that and see what happens.
look penguin cleaned.txt penguin 13556 9309 4161 0.348846025509 penguin's 367 327 253 0.009444267583 penguinclassics 83 83 73 0.002135897028 penguinputnam 254 237 195 0.006536359581 penguins 18172 11653 4532 0.467632780728Shit, that was quick! How quick, on this old cheap feeble computenbox?
time look penguin cleaned.txt penguin 13556 9309 4161 0.348846025509 penguin's 367 327 253 0.009444267583 penguinclassics 83 83 73 0.002135897028 penguinputnam 254 237 195 0.006536359581 penguins 18172 11653 4532 0.467632780728 real 0m0.354s user 0m0.272s sys 0m0.084sHuh. Not too shabby, but 354 milliseconds is still the wrong order of magnitude compared to Rhomboid's reported 67ms for ten lookups. Could it be that my old computer is ten times slower than his old computer?
But wait! I have forgotten the look command's -b option, and it's doing a linear string search! So what happens when I actually do the binary search that sorting the file made possible?
time look -b penguin cleaned.txt penguin 13556 9309 4161 0.348846025509 penguin's 367 327 253 0.009444267583 penguinclassics 83 83 73 0.002135897028 penguinputnam 254 237 195 0.006536359581 penguins 18172 11653 4532 0.467632780728 real 0m0.003s user 0m0.000s sys 0m0.000sHo.
Ly.
Shit.
Now we have some cleaning up to do. First thing is to whack a tab character on the end of the search string so that look will only match it as a whole word:
time look -b penguin'\t' cleaned.txt
real 0m0.003s
user 0m0.000s
sys 0m0.004s
Huh! No search output. Looks like look doesn't interpret \t as tab. No matter; by using a handy $'bashism' instead of the bare single 'quotes', I can get bash to pre-interpret it:
look -b penguin$'\t' cleaned.txt penguin 13556 9309 4161 0.348846025509Better. Now let's apply look to each word in our wordlist, and time the whole thing for good measure:
time while read -r word; do look -b "$word"$'\t' cleaned.txt; done <wordlist apple 299124 220737 87122 7.697567020830 banana 120007 86497 35304 3.088224032404 catalog 189080 122189 43304 4.865727832934 dormant 98009 88805 53763 2.522134118775 eagle 152601 112901 49805 3.926988222094 fruit 1101184 790600 180224 28.337537751119 goose 96921 77156 44265 2.494135854114 hat 770770 613786 141106 19.834763284274 icicle 4372 3934 3331 0.112507732630 real 0m0.024s user 0m0.008s sys 0m0.016sOutstanding! And no extra tools required.
Get rid of the timer and pipe the result through sort and cut as we did yesterday, and we have
while read -r word; do look -b "$word"$'\t' cleaned.txt; done <wordlist | sort -rnk5 | cut -f1 fruit hat apple catalog eagle banana dormant goose iciclepretty much instantly.
posted by flabdablet at 10:29 PM on March 30, 2011
Wow, you guys put a lot of work into this! I wish I had seen it earlier. I used Python, rather than a shell script, with xueexueg's Google 1-grams (binary-searching like flabdablet suggests). My solution was obviously much longer than the one-liner above.
I had no idea about the "look" command (and I didn't know much about sed either), so this was educational. Thanks!
posted by Chicken Boolean at 10:51 AM on April 2, 2011
I had no idea about the "look" command (and I didn't know much about sed either), so this was educational. Thanks!
posted by Chicken Boolean at 10:51 AM on April 2, 2011
My solution was obviously much longer than the one-liner above
What do you expect, if you insist on coding in these low-level languages? :-)
posted by flabdablet at 6:48 PM on April 2, 2011
What do you expect, if you insist on coding in these low-level languages? :-)
posted by flabdablet at 6:48 PM on April 2, 2011
This thread is closed to new comments.
posted by demiurge at 4:35 PM on March 28, 2011