I'm playing with a little word frequency analysis just for fun, and I've found, somewhat unsurprisingly, that the British National Corpus isn't the best for comparative analysis of modern American English and it cuts off at 20 words per million. The Brown Corpus is better, but not available, the American National Corpus costs $75, and the
1st edition of the ANC is OK, and is as recent as 2003, but it would be awesome if someone knew of a corpus prepared each year, so I could look at the increase in word usage over time.
Also, I'm using the
word counter from Catherine Ball at Georgetown, which is fine, but what I'm looking at doing is picking out statistically significant words, so if there was a word counter that used a corpus on the back end to produce an "
expectation value" that showed the frequency of the word in submitted text in relation to the frequency of the word in a corpus, that would also rock. I could probably even make one if none exists, had I an appropriate corpus.
Finally, I'd like to hear about great ideas and tools for visualization of this kind of data.
Here's a previous related question.
posted by Mr. Gunn at 3:32 PM on May 6, 2007