Unique word list application needed
June 4, 2007 7:37 PM   Subscribe

I'm looking for a method or application that will create a list of unique words in a large text document (a Word document, currently, but can be changed to accommodate another format).

I want to see just a list of the of the words in this document, but don't need to know their frequency or order. So the list would have every word used at least once and only show a word one time.

Possible? I am on a Mac, but can work on a PC if needed. Would also be fine to use a terminal app. Need a free solution, so commercial apps will not fit this particular bill.
posted by qwip to Computers & Internet (20 answers total) 1 user marked this as a favorite
 
Quick, dirty and assumes you have word and excel:

Paste text into Word. Select 'replace'. Search for a space and replace with a line break. Copy the list of words and paste into Excel. Select Data->Filter->Advanced Filter. Click on 'Filter the list in place' and then on 'Unique records only'.

This will give you a list of the words used without dupes.
posted by jikel_morten at 7:43 PM on June 4, 2007


Oops - sorry. I ignored/missed the commercial app bit.
posted by jikel_morten at 7:54 PM on June 4, 2007


This VBA macro claims to do what you need. Not sure if your version of Mac Word can handle it.
posted by djb at 8:01 PM on June 4, 2007


Take jikel_morten's solution, but paste into a text file instead of Excel, then run the Unix command uniq on that file from terminal
posted by phoenixy at 8:02 PM on June 4, 2007


fmt -1 filename.txt | sort | uniq
posted by putril at 8:02 PM on June 4, 2007


Whoops, sorry, you need to do this, actually

sort | uniq

I forgot that uniq compares adjacent lines. That straight line in the middle is the pipe key.

posted by phoenixy at 8:03 PM on June 4, 2007


I like putril's suggestion, but it doesn't deal well with punctuation (a word followed by a comma is treated as different from the same word without the comma).

How about this perl?
perl -e 'while (<>) { s/([\w\x27]+)/$w{lc($1)}=1/ge; print join("\n", sort keys %w);' filename.txt

\x27 = apostrophe. Without that I was getting "don" and "t" as two separate words.
posted by aneel at 8:29 PM on June 4, 2007


Oh. one other gotcha is hyphen, so make that square bracketed bit into: [\w\x27-]

It still won't deal properly with hyphenated words split across lines, though.


Or to extend putril's suggestion:
fmt -1 filename.txt | sed -e "s/[^A-z'-]//g" | sort | uniq

The perl version lowercases the words, the pipe version doesn't.
posted by aneel at 8:43 PM on June 4, 2007


There's a free program called TextSTAT that does what you're asking for. It can directly read MS word files. It does print out the frequency for each word, but you can export the results to a text or CSV file and erase that column. The program is available for Windows, Linux, MacOS X.
posted by Jasper Friendly Bear at 8:48 PM on June 4, 2007 [1 favorite]


Response by poster: Hey, just want to thank all for quick and comprehensive answers!

The terminal solutions were less usable than the Word/Excel proposed by jikel_morten, so thanks for that. It was down and dirty, but a tad blunt.

The VBA macro suggested by djb, is quick, but adds the word count on the end. Combined with jikel's solution (and a little clean-up) it works like a champ.

I wasn't able to get the perl or python (TextSTAT) solutions working, as frankly I have no idea how to make them work from the command line or otherwise. So, they may be perfect, but I couldn't make it happen.

Thanks, everyone!
posted by qwip at 9:27 PM on June 4, 2007


Glad you got something that worked.

For future reference: to use a command line suggestion like the ones above, first save your document as a plain text file, preferably with a name not including spaces. Open a Terminal window. Type the command as shown, but replace "filename.txt" with the location of the text file you saved.

What's the "location" of the file? When you open a Terminal, it will default to your Home directory. If the text file is there, just type its name. If the text file is somewhere else, Mac OS X has a neat trick: if you drag and drop the icon for your file on the Terminal window, the location of the file will be typed for you.

When you hit return, the results of the command will be printed in the Terminal window. You can use cut and paste to copy them, or if the output is long, you can add this to the end of one of those recipes:
> count.txt
That prints the results to a file called count.txt (again, in your Home directory by default). Be a little careful it will happily replace any file that already exists with that name.

Here's a full example:
fmt -1 /Users/aneel/Documents/mybook.txt | sed -e "s/[^A-z'-]//g" | sort | uniq > count.txt
posted by aneel at 9:53 PM on June 4, 2007


Scrivener has a word frequency measuring option. It's not free, you can download the free trial if it's a one off.
posted by dhruva at 10:38 PM on June 4, 2007


Response by poster: Hey, aneel that worked a treat. I especially liked the pipe to the text file (although it took me a moment to find that it went to my ~/ folder (thank you, Spotlight!). Cheers!

For the group, just out of curiosity, how would one (who's not terribly bright) run one of the above perl or python scripts against said text file?

Oh, and for anyone who is curious, the file had 2,284 unique words and the highest frequency for a word over 5 letters was 214.
posted by qwip at 1:52 AM on June 5, 2007


1. Convert file to plain text and put in your home directory.

2. Open terminal:

perl -e -i '@foo = split /\s/g; print map {"$_\n"}@foo;' myfile.txt | sort | uniq > mywordlist.txt

(untested, you can replace any other of the command line suggestions here).

perl has lots of modules to support this kind of stuff. you could use Lingua::En::StopWords to filter the stopwords out of the word list.

also these variants on the end:

sort|uniq -c | sort -d # word sorted by frequency of use with count
sort|uniq | wc -l (number of unique words)
posted by singingfish at 5:40 AM on June 5, 2007


Running a perl command on a file is just like running another command. Just replace the "filename.txt" in the example with the location of the text file.

So:
perl -e 'while (<>) { s/([\w\x27]+)/$w{lc($1)}=1/ge; print join("\n", sort keys %w);' /Users/aneel/Documents/mybook.txt

or:
perl -e 'while (<>) { s/([\w\x27]+)/$w{lc($1)}=1/ge; print join("\n", sort keys %w);' /Users/aneel/Documents/mybook.txt > count.txt
posted by aneel at 11:22 PM on June 5, 2007


Response by poster: Hmm. Whenever I run the perl commands I get this error:

Missing right curly or square bracket at -e line 1, at end of line
syntax error at -e line 1, at EOF
Execution of -e aborted due to compilation errors.

posted by qwip at 11:30 PM on June 5, 2007


Oops. That's because there's a missing right curly bracket. Sorry about that.

perl -e 'while (<>) { s/([\w\x27]+)/$w{lc($1)}=1/ge; } print join("\n", sort keys %w) . "\n";' filename.txt

There's also an extra newline printed at the end of the file in this version. That was why the copy and paste of the previous version lost a few characters.

Incidentally, if you want a list of the words over 5 letters, along with their frequencies...

perl -e 'while (<>) { s/([\w\x27]+)/$w{lc($1)}++/ge; } print join("\n", sort {$a <> $b} map {"$w{$_}\t$_"} grep {length($_)>5} keys %w) . "\n";' filename.txt

or:
fmt -1 filename.txt | sed -e "s/[^A-z'-]//g" | sed "y/[A-Z]/[a-z]/" | grep "......" | sort | uniq -c | sort

The latter seems a little clearer, since it's just a sequence of steps in order: put each word on a line, remove anything that's not a letter or apostrophe or hyphen, change uppercase letters to lowercase, select only the lines that contain at least six characters, sort that list, count the consecutive words that are the same, sort that list.
posted by aneel at 12:05 AM on June 7, 2007


Hmm. The editor messed with the perl version in a subtle way. should be sort {$a <=> $b}
posted by aneel at 12:08 AM on June 7, 2007


Argh. sed "y/[A-Z]/[a-z]/" doesn't actually lowercase the text. It convertz "A" to "a" and "Z" to "z", but nothing inbetween.

How about:
perl -e 'while (<>) { s/([A-Za-z\x27-]+)/$w{lc($1)}++/ge; } print join("\n", sort map {sprintf("%7d $_", $w{$_})} grep {/....../} keys %w) . "\n";' filename.txt > perl.out

and:
sed -e "s/[^A-Za-z'-]/\n/g" filename.txt | fmt -1 | sed "y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/" | grep "......" | sort | uniq -c | sort > pipe.out

I hope I got it right this time...
posted by aneel at 12:26 AM on June 7, 2007


Response by poster: Bon! That is sweet, aneel and thanks for the explanation and code fix for the perl. Very cool and helpful.
posted by qwip at 2:47 AM on June 8, 2007


« Older Summer Vacation: Prague, Barcelona, or Puerto Rico   |   31 years in CA and I still don't surf? Newer »
This thread is closed to new comments.