How many unique words?
December 3, 2009 11:25 AM   Subscribe

How can I determine how many unique words and how many repetitions are in a text?

I have a small translation contract and the client has different rates for repetitions and for unique words. How can I find this out? They are Word documents, I'm on a Mac using Word 2004. Is there a program I can download, or a web-based solution?
posted by OLechat to Computers & Internet (21 answers total) 2 users marked this as a favorite
 
the Atlantis Word Processor has this as a core feature (Tools-->Overused Words). Despite have the misleading "Overused Words" moniker, the feature actually offers a very robust statistical breakdown of all the unique words and repetitions (of both individual words and word pairs).

I don't think Word does anything similar. You can download a free trial of Atlantis though, and it opens Word documents. You'll have to use a PC, however.
posted by 256 at 11:34 AM on December 3, 2009


You could use concordance building software like antconc or concorder.
posted by gyusan at 11:44 AM on December 3, 2009


You can sort of do this in excel, though it might be a many-stepped process...

-Do finds and replaces to get rid of any unwanted punctuation (e.g. periods... you can probably go either way on apostrophes and stuff since they're part of words) and returns -- your goal here is to have it be one big block of text with nothing but spaces between each word.
- Then do a find and replace of spaces to returns, so each word is on its own line.
- Copy this into excel, so that each word is a cell in column A
- Sort this list alphabetically
- Then put the following formula into cell B1:
=IF(COUNTIF($A$2:A2,A2)=1,1,"")
- Fill down column B. This will put a "1" next to each unique word and nothing next to each repeated word (you can change these by changing what comes after the first and second commas in the formula). This step tends to take excel a while. I like to copy this column and then paste it as values for any manipulation.
- You can then do something like sum this column -- voila, that is how many unique words you have. The remainder (easily found by seeing what row your last word is in, then subtracting your unique words) are duplicate words.

I'm sure there are programs that will do this easier/faster, but if you can't get access to those programs, this ought to work.
posted by brainmouse at 11:46 AM on December 3, 2009


It's already there in the form of Unix shell tools. Save your doc as plain text. Open Terminal.

Assume your file name is "YOURDOC".

To get word count:

wc -l YOURDOC

To get unique words, (put each word on a line (split on space and punct except apostrophe). sort all lines. discard empty lines. collapse runs of same word. count lines):

tr '!?.,;:"() ' '\n' <>
To get each word's number of repitions, same except don't count lines.

tr '!?.,;:"() ' '\n' <>SAVED_RESULT_FILE

...maybe also sort by number of occurances.

tr '!?.,;:"() ' '\n' <> SAVED_RESULT_FILE
posted by cmiller at 11:48 AM on December 3, 2009


Aaah I messed up the formula. Assuming your first word is in cell A1, the formula should actually be:

=IF(COUNTIF($A$1:A1,A1)=1,1,"")

(also, if it was unclear, if you do a find and replace where you put a period in the find and put nothing in the replace box, it will just delete all the periods).
posted by brainmouse at 11:48 AM on December 3, 2009


Damn you, Metafilter markup breaker!
posted by cmiller at 11:48 AM on December 3, 2009


Brute force:

Most text editors will do word counts. This will give you total word count

To get unique word count:
You can also do a search and replace of the space character with a carriage return and then sort the resulting list. Most test editors have an option on sorting to "remove duplicates"
The length of the resulting sorted list is the # of unique words you have.

Repetitions:
It's not clear if you get paid for any repetition at a consistent rate or if there is a sliding scale.
If it fixed, The # of repeated words should be the number of total words - number of unique words.
posted by bottlebrushtree at 11:49 AM on December 3, 2009


Here's a one-liner that'll do it. :)
perl -nle '$w{$_}++ for grep /\w/, map { s/[\. ,]*$//g; lc($_) } split; sub END { printf("%7d\t%s\n", $c, $w) while (($w,$c) = each(%w)) }' filename.txt | sort

posted by TheNewWazoo at 11:51 AM on December 3, 2009


I should add, replacing that final "sort" with "wc -l" will give you a count of unique words.
posted by TheNewWazoo at 11:52 AM on December 3, 2009


wc -l YOURDOC

tr '!?.,;:"() ' '\n' < YOURDOC |grep -v '^$' |sort |uniq -c |wc -l

tr '!?.,;:"() ' '\n' < YOURDOC |grep -v '^$' |sort |uniq -c > SAVED_RESULT_FILE

tr '!?.,;:"() ' '\n' < YOURDOC |grep -v '^$' |sort |uniq -c |sort -n > SAVED_RESULT_FILE
posted by cmiller at 11:53 AM on December 3, 2009


Old school unix command line style:

save the document as plain text.

cat file | sed -e 's/ /\n/g' | sed -e 's/[^a-zA-Z]//'| tr A-Z a-z|sort | uniq -c

will give you an alphabetical list of all unique strings of characters in the file, with a numeric usage count for each. This of course will not account for line break hyphenations or conjugated words, but it is quick and easy and time-tested and ships with every Mac.
posted by idiopath at 11:58 AM on December 3, 2009


oops, since we are all thinking outloud with our command line solutions here:

cat file | sed -e 's/[^a-zA-Z]//'| tr A-Z a-z|sort | uniq -c | sort -n

The first sed call was redundant.

cmiller: your version calls different capitalizations the same word, but I like your final sort by usage count so I am stealing that :)
posted by idiopath at 12:03 PM on December 3, 2009


err of course I meant calls different capitalizations different words.
posted by idiopath at 12:04 PM on December 3, 2009


If the command line stuff is scaring you, the king of all Mac text processors, BBEdit, has unique-word-count and remove-duplicates and a million other handy utility functions as menu commands that you can use on any text file.
posted by rokusan at 12:09 PM on December 3, 2009


Are you a professional translator? Are you planning on making a trade of this business? If so - you're going to want to get your hands on any of the available translation memory tools commercial (SDL Trados is the most popular) or open source (OmegaT is the go to option there). There are a ton of others - search for "Translation Memory". None of them are great - but they are absolutely necessary in a career in translation.

This word price disparity exists because all of those tools give you reports about new, fuzzy matches (in an existing memory) and repetitions. You are being paid less, because these tools will auto suggest or populate exact matches across a file. Since you do not have these tools, you should not take their price differential - as the work required for you to do this job will be the same whether the word is previously translated or not.
posted by Wolfie at 12:13 PM on December 3, 2009 [3 favorites]


Wolfie is spot on. And even if you have a tool like this there is argument about whether you should ever take much of a sliding scale for whatever percent repetitions anyway cos you are responsible for quality checking, reading and understanding the entire document and you have paid for the tool yourself to look after a small part of this process. Wanting you to accept a pay scale is the sign of a crappy client, even if it is common practice.
posted by runincircles at 12:38 PM on December 3, 2009 [1 favorite]


Response by poster: Wow, I wasn't expecting so many excellent replies! The client ended up sending me the word count shortly after I put this up, but I'll try out the command line suggestions. Thanks to everyone for their help.

Wolfie and runincircles: I'm not a professional, this is an occasional thing. I thought that charging two different rates was common practice, you've definitely given me something to think about.
posted by OLechat at 1:46 PM on December 3, 2009


If it's not something that needs a ton of work, and not ultra-secret, you can send it to me, and I'll send you the results. I've written a ton of text-parsing tools, and have something that can do this. MeMail me, if so.

...Just a thought. Go nuts with the Perl, etc, listed above, if you're familiar with all that.
posted by iftheaccidentwill at 2:32 PM on December 3, 2009


To get word count:

wc -l YOURDOC


No no no! It's

wc -w YOURDOC

Other instances of "wc -l" seem to be okay on quick scan.
posted by tss at 2:50 PM on December 3, 2009


Coming in late but I hope this will be helpful - what you're looking for is not repetitions of *words* but repetitions of *translation units or segments* (i.e. sentences), and any translation tool as described by Wolfie will do such an analysis.

Your client is undoubtedly talking about repeated sentences, which a translation tool can be very helpful with, as it will suggest a matching sentence if you've translated it before. A word-level analysis is not useful, as the same word can (and often should) be translated differently according to context, and much (good) translation is not done word-for-word in any case.

In any case, an analysis is useless if you don't have the tools that will help you exploit repeated sentences. I'd tell the client you don't use CAT tools, unless you're prepared to download a trial version or two and try them out. If you want to do a significant amount of translation, the tools are invaluable and I strongly suggest you read up on them.
posted by altolinguistic at 3:05 AM on December 4, 2009


In other words, the first dozen or so comments here, while pretty, are useless for your purposes.

In terms of pricing - CAT tools can speed your work up considerably, but IMO there's too much pressure from the client side to pass this on in terms of discounts. A discount can in some instances be warranted, but for a translator operating on a freelance basis this should be a decision freely and intelligently made.
posted by altolinguistic at 3:13 AM on December 4, 2009


« Older Getting equipped to groove   |   In what cartoon does Foghorn Leghorn say, "Kids... Newer »
This thread is closed to new comments.