Software that will do a distinct word count?
November 17, 2004 3:15 AM   Subscribe

My friend from school claims that the average American only has about 10,000 words in his/her everyday vocabulary. As an example, he cites USAToday, saying if we were to count the number of distinct words used in that paper, there would only be a few thousand. I would like to check this out. Does anyone know of any software that will do a distinct word count?
posted by bluefly to Computers & Internet (10 answers total)
From New scientist ...
Steven Pinker in The Language Instinct compared the probable 60,000-word vocabulary of a typical US high-school graduate with the 15,000 words used in the complete works of Shakespeare, thus defining the "tetrabard" as a unit of vocabulary. We suspect that David Ridpath, who reminded us of this, may be a jaded teacher: "I can think of a few centibards I have known," he grumbles.
posted by seanyboy at 4:04 AM on November 17, 2004

Anyway... try google and Unique Word Count
and here and here and here
posted by seanyboy at 4:16 AM on November 17, 2004

A cool program I like is NoteTab Light. In addition to being a decent text editor, it can give Text Statistics. I can make a table of all the words and punccuation in a text file, and sort by frequency. You can even right click on the list and save it as a text file. (It's a tab delimited text file, so it will open in Excel pretty easily too.)

Different words/items counted: 53
Total Words: 70
Total Punctuation: 8
Total Other Text: 0
Total Characters: 346
Total Paragraphs: 1
posted by ALongDecember at 5:26 AM on November 17, 2004

It seems like it would be difficult to exclude proper names.
posted by smackfu at 6:56 AM on November 17, 2004

Not an application, but this Google Answer addresses the vocabulary size question.
posted by putzface_dickman at 7:24 AM on November 17, 2004

A little word count thingy in perl is easy to throw together...

...but it also depends on what you mean by "distinct words".

sing, sings, singing, sang, sung --five different words or all the same? If that's important to you, you'll want to ask whether the software or script you're using to count words takes plurals, etc. into account. It might need to have a built-in dictionary of some kind.

Or, you could get a raw count and then calculate backwards based on the probability that words will be plural, past tense, etc. to get a ballpark figure.
posted by gimonca at 8:00 AM on November 17, 2004

Here's a quick-and-dirty command line version:
perl -pe '$/ = undef; s/< .+?>//g' whatever.html | \
sed -re 's/(\s+|--)/\n/g' | tr 'A-Z' 'a-z' | \
sed -e 's/^[^a-z]*//' | sed -e 's/[^a-z]*$//' | \
grep -v nbsp | grep -v href | sort | uniq | wc -l
Of course, it counts proper names, and counts different tenses/conjugations/etc. of the same word separately.

I tried running this on various texts, including a bunch of USA Today stories, A Tale of Two Cities, Hamlet, Neuromancer, and a collection of old newspaper editorials called 'Editorals from the Hearst Newspapers' from Project Gutenberg.

I certainly didn't find that USA Today's vocabulary was any less varied than any of the other texts. I measured the number of unique words vs. the total number of words. Generally this number got lower as the length of the sample increased until it levelled off at somewhere around 10%. So I reduced each text to a sample of around 10,000 words, and then measured how many were unique. Results: USA Today 27.8%, Hamlet 22.4%, Neuromancer 24.1%, blog 23.5%, A Tale of Two Cities 20.4%, old newspaper editorials 23.2%.

Of course USA Today gets a bit of a boost because it uses a lot of proper names. If you filtered them out I'd bet it would fall more in line with the other texts.

As to whether there are only a few thousand unique words in all of a given issue of USA Today, I doubt that, since we're already up to almost 3,000 in only a 10,000 word sample. However, I'm not sure that's the right question to be asking, since, e.g., Hamlet only contains about 5,000 unique words (out of 32,000), and A Tale of Two Cities only about 10,000 (out of 136,000).

I don't buy that teenagers' vocabularies today are any worse than they were 50 years ago (from the Google Answers link). Everyone seems to be parroting that statistic without any understanding of how the study was performed. I tried to find where they're getting that number. The online sources all seem to point to a book called _The Resurgence of the Real_. I looked up the reference in that book on Amazon, and its endnotes refer back to a Harper's Index from August 1990. Come on, you're going to use a Harper's Index as a reference without consulting the original study?

I bet if you could somehow find a recording of that crotchety professor when he was 17, he wouldn't sound particularly erudite. Teenagers speak in a way that will help them fit in with other teenagers; showing off a large vocabulary (even if you have one) is detrimental to that goal.
posted by mcguirk at 9:45 AM on November 17, 2004

There is not supposed to be a space before .+? in the first line, by the way.
posted by mcguirk at 9:47 AM on November 17, 2004

As an example, he cites USAToday, saying if we were to count the number of distinct words used in that paper,
Thought newspapers are written in grade school english for easy reading. If true, your example may be off since a lot of Americans are educated further past grade school.
posted by thomcatspike at 9:49 AM on November 17, 2004

There have been many studies done of human vocabulary. I don't think that USA Today is particularly representative of what a single person's vocabulary range might be, though.

Certainly most newspapers are written to a broad audience, and while USA Today is no slouch (winning Pulitzers, excellent 9/11 coverage), it is definitely written in a breezier, quick-read, and probably lower grade-level English than even most newspapers. (Its prototypical reader is a business traveler in an airport or over a bagel in a hotel.) I know I've seen a ranking of newspapers by reading grade level, but I can't find one right now (the WSJ is highest, NYT right behind, but most are at a high-school level around 10th grade). Ah -- as it happens, the NY Observer just published a piece comparing the NYC papers' reading levels on a common metric, a local crime story. The measurement was done using the Flesch-Kincaid Index, a formula for estimating the complexity of prose. This is available in MS Word, by the way. Here's a broader overview of reading levels and various popular publications measured against them. See also Flesch-Kincaid analysis of Inaugural Speeches.

In any case, I think this simply demonstrates that your sample is flawed if you're trying to estimate the average adult's vocabulary. But if you're inclined to experiment (and I too thought these formulas were neat when I finally had computers that could do it all for me -- but that was 1985), there's plenty of fodder.
posted by dhartung at 11:40 PM on November 17, 2004

« Older Supportive in the Kitchen   |   Bomb Threat Plans Newer »
This thread is closed to new comments.