American English word frequencies?
April 9, 2007 9:38 PM   Subscribe

Where can I find a good word frequency list for American English?

Requirements: Not just sorted by frequency, but with specific frequency information for each word. Not lemmatized. These lists are almost perfect, except they're based on British English (and they have separate entries for "n't" and "'s").

Does a such a list even exist? Is this the kind of thing I can't get online? Google has revealed to me only other lists based on the British National Corpus and a lot of word lists for open source spell checkers and Scrabble programs. Am I missing something?

(Suggestions for other interesting corpora that I could use to generate my own list of this sort would also be appreciated—this is for a generative poetry project.)
posted by aparrish to Writing & Language (12 answers total) 2 users marked this as a favorite
 
I suggest training your generator on a specific poet and seeing if you can get it to emulate that style. Then you can add in other base text to get more breadth.

For example, generate frequencies from http://www.infomotions.com/etexts/literature/english/1600-1699/shakespeare-sonnets-59.txt
posted by demiurge at 10:01 PM on April 9, 2007


The Brown Corpus was what they referred us to in a computational linguistics course I took ~5 years ago. It seems like there may be copyright issues that explain why it is not more widely available, but the first page of google results includes this plaintext version.

The LDC at Penn may be of help.
posted by epugachev at 10:16 PM on April 9, 2007


Oh, of course if you generate frequencies yourself, you'll probably need a part-of-speech tagger, which could be extra work for you.

I'm not sure I understand what's wrong with the lists you linked to, do you just not want British English?
posted by demiurge at 10:21 PM on April 9, 2007


Best answer: The first release of the American National Corpus is available for $75 for non-commercial use.
posted by lukemeister at 10:28 PM on April 9, 2007


The MRC Psycholinguistic Database has helped me before for psychology projects. It runs a Unix dict command on the database based on parameters you set, and it includes Brown Frequency. Take a look at the parameters, to get a list of the most popular words set the min on the BROWN-FREQ at about 50 or so and you'll get a large enough list. You'll have to sort the list in Excel, but it'll cut and paste easily.
posted by ALongDecember at 10:29 PM on April 9, 2007


wordcount?

i think they have an API as well, if you are technically-minded.
posted by UbuRoivas at 11:15 PM on April 9, 2007


oh, cancel that.

WordCount data currently comes from the British National Corpus.
posted by UbuRoivas at 11:17 PM on April 9, 2007


Do the wiktionary:Frequency Lists help?
posted by Citizen Premier at 12:44 AM on April 10, 2007


Seconding the MRC database. However, please note that the Brown-Freq it gives is NOT THE SAME THING as the Brown Corpus, which is probably what you want. The labels are by editor, so the data for the Brown Corpus is under Kucera-Francis Frequency (edited by Francis and Kucera, 1967). "Brown" here refers to the London-Lund corpus (edited by Brown, 1984). The LLC corpus is British English, the Brown is American.
posted by miagaille at 7:05 AM on April 10, 2007


Best answer: The nltk-lite corpora package includes a pos-tagged version of the Brown corpus.
posted by snownoid at 9:07 AM on April 10, 2007


Response by poster: Thanks for the leads, everyone!
posted by aparrish at 7:43 PM on April 12, 2007


The Brown corpus, be it noted, is from the 1960s, and I think it's mostly business/light-literary English. If that's what you want to generate, OK, but the vocabulary might not be very interesting for poetry.

For creative purposes, I think you'd be OK using British sources. If it really had to taste American, you could post-process it for spelling and have/had gotten and so forth. Yes, there are still differences of frequency, but I'm not sure anyone is going to stand over your generator long enough to establish for sure that its ratio of sentence-initial so to sentence-final then is too low for red-blooded American poetry.

And for what it's worth, I once used Variation In English Words (a web interface to the British National Corpus) to search for words that were exceptionally more common in one register than another. For example, I took the adjectives that are relatively prevalent in fiction as contrasted with news; the top five hits were faint, silken, husky, rueful, and momentary—now just try and tell me that isn't Distilled Essence of Bodice-Ripper. I can see those sorts of lists helping you create a near-parodic, over-the-top feeling—and that, for me, is what makes generated art work.

Or you could slurp the collected works of Emily Dickinson from Project Gutenberg, and Perl up the frequency lists yourself. Possibilities ...
posted by eritain at 9:23 PM on April 17, 2007


« Older When pet food is bad food   |   I am interested in building my own furniture. Newer »
This thread is closed to new comments.