Tools for Word Frequency Analysis
May 6, 2007 2:52 PM   Subscribe

Is there a recent American Word Corpus available for free?

I'm playing with a little word frequency analysis just for fun, and I've found, somewhat unsurprisingly, that the British National Corpus isn't the best for comparative analysis of modern American English and it cuts off at 20 words per million. The Brown Corpus is better, but not available, the American National Corpus costs $75, and the 1st edition of the ANC is OK, and is as recent as 2003, but it would be awesome if someone knew of a corpus prepared each year, so I could look at the increase in word usage over time.

Also, I'm using the word counter from Catherine Ball at Georgetown, which is fine, but what I'm looking at doing is picking out statistically significant words, so if there was a word counter that used a corpus on the back end to produce an "expectation value" that showed the frequency of the word in submitted text in relation to the frequency of the word in a corpus, that would also rock. I could probably even make one if none exists, had I an appropriate corpus.

Finally, I'd like to hear about great ideas and tools for visualization of this kind of data.
Here's a previous related question.
posted by Mr. Gunn to Media & Arts (4 answers total) 5 users marked this as a favorite
 
If you want to see an example of something I'm doing as an exercise, check my profile for a link.
posted by Mr. Gunn at 3:32 PM on May 6, 2007


The Brown Corpus is better, but not available

It is, see my answer to the question you linked to.

I don't know of any free yearly corpora, but if you don't need an annotated corpus anyway, just get texts from a newspaper website or so.

Here's a calculator that compares frequencies between corpora, this paper explains how it works. This describes a word of the day system and might also be useful.

Tag clouds are often used when it comes to visualizing these things.
posted by snownoid at 3:59 PM on May 6, 2007


on a linux machine, the bash command to create a concordance would be:

cat corpus | tr -sc '[A-Z][a-z]' '[\012*]' | sort | uniq -c | sort -nr > sorted_concordance


then you can count the total number of words:

cat corpus | wc -w


Hope that helps.
posted by Freen at 4:03 PM on May 6, 2007


The MRC Psycholinguistic Database contains the data from the Brown and London-Lund corpora. Of course, neither of these are particularly recent.
posted by miagaille at 6:36 PM on May 6, 2007


« Older How can I transport alcohol from Missouri to...   |   I need to prove that I was born Newer »
This thread is closed to new comments.