What is the word frequency distribution in the NY Times?
January 22, 2009 11:22 AM

How many different words (excluding proper nouns) appear in the New York Times on average?

I remember hearing this friendly fact at one point, but I can't find it anywhere on Google or MeFi. It was something along the lines of "300 words make up 80% of the New York Times." Does anyone have the actual frequency off-hand?
posted by stevekinney to Writing & Language (9 answers total)
This doesn't directly answer your question, but you might be interested to look at this word frequency list from the Brown Corpus. This is a collection of text from books, newspaper and magazine articles. You can see that short, common words dominate the top of the charts, as you might expect.

By far, words like the, of, by, that, is, for, etc. are the most popular. The frequency distribution of words is said to follow Zipf's law. If you're interested in word frequency, you might want to check out this searchable database of word frequency composed of data from Time magazine.
posted by demiurge at 11:52 AM on January 22, 2009


Not that following demiurge's link to the Brown Corpus, the first 2000 words make up only 75% of the words.

The first 300 words are approximately 57%. Assuming the Brown Corpus is fairly representative (probably reasonably accurate), that quote is off in values, but not by too much.

Methodology for those who are interested: copy the word list into Excel and build a new column of cumulative frequency.
posted by JMOZ at 12:20 PM on January 22, 2009


Note*
posted by JMOZ at 12:24 PM on January 22, 2009


Not super scientific, but you can use Wordle to generate a "word cloud" online. Just check the "Do not remove common words" option. ... Ta-daa!
posted by misterbrandt at 12:28 PM on January 22, 2009


Word frequency in natural language tends to follow a Zipf distribution.

Plotting log(freq) vs. rank gives you a straight line.

"There are many ways to state Zipf's Law but the simplest is procedural: Take all the words in a body of text, for example today's issue of the New York Times, and count the number of times each word appears. If the resulting histogram is sorted by rank, with the most frequently appearing word first, and so on ("a", "the", "for", "by", "and"...), then the shape of the curve is "Zipf curve" for that text. If the Zipf curve is plotted on a log-log scale, it appears as a straight line with a slope of -1." [source]
posted by qxntpqbbbqxl at 12:35 PM on January 22, 2009


Oops, meant to link to Zipf's law instead of Zipf himself. Stupid single quotes.
posted by qxntpqbbbqxl at 12:38 PM on January 22, 2009


Seems like I heard 161k words (not distinct words) for the NYT Sunday edition. Not sure where I heard it, though.
posted by charlesv at 1:02 PM on January 22, 2009


There's another corpus that is...maybe a subset of the Brown corpus? That's just a huge set of tagged nyt articles, so you could actually figure this out. I remember using it for a project at some point but don't remember where I got it, probably from either Penn, Princeton, or Brown computational linguistics.
posted by jeb at 3:33 PM on January 22, 2009


jeb: U Penn is the home of all sorts of corpora, by way of the Linguistic Data Consortium, so that might be where it came from.

But anyway: The New York Times is one of the subcorpora of the Corpus of Contemporary American English, so if you search that for several words and restrict your hits to newspaper, you should be able to get enough data points to calculate the Zipf curve. You'll need to know the words' ranks, of course; for that I'd cheat over to British English and use Adam Kilgarriff's frequency-sorted unlemmatized list, since within the most frequent words there's not likely to be a significant difference except possibly for have and got.

At one point I had the rank-to-frequency conversion for English all calculated out (I wanted a collection of function words of a specific size, available in proportion to their real use, for something like magnetic poetry). I don't have it handy any more, but rest assured it can be calculated from corpus data with very little error, and once you've got that you can integrate over rank to see how much coverage you get for how many words. Have fun.
posted by eritain at 5:10 PM on January 22, 2009


« Older Leg cramps at the gym   |   My Aunt Hi-jacked Grandma's Estate, help. Newer »
This thread is closed to new comments.