Finding number of unique words in a document using Python
June 6, 2011 11:53 AM   Subscribe

Newbie Python advice required: counting unique keywords in a document.

I'm almost completely new to Python, and have been trying to write a programme to show the count of each unique word in a document. So what I want at the end is an output that tells me there are 10 uses of 'and', 5 uses of 'it', 23 uses of 'of' and so on.

The way I've approached it to this point is:
- Read the text tile using open and read
- Split the text using text.split()
- Convert everything to lowercase using text.lower()
- Create a set of the individual words which automatically filters so that it only contains unique words

The length of the set gives me the total number of uniques, but that's not what I'm after. If I could iterate through the records in the set, I could count how many times each unique word appears in the text when it has been split into words from the original, but if it is possibly to iterate through a set in that way in Python, I have not been able to figure out how to do it. (I feel that part of my difficulty is that I am 'cheating' by using the set functionality to filter for uniques, and I should be doing this for myself in some way, presumably with regular expressions(?))

From some research, I believe that another way to approach this problem would be using the dictionary in Python and a hash table, though I don't fully understand what that means.

My question is then: is there a way to get to what I want using what I have done so far, or do I need to start again by learning about some other technique (e.g. hash tables)?

Thanks as ever, MeFi.
posted by StephenF to Computers & Internet (16 answers total) 4 users marked this as a favorite
 
You want to use the hash table (does Python call them dictionaries?). Hash tables are arrays that are keyed on strings, rather than regular arrays which are keyed on numbers.
posted by scruss at 11:59 AM on June 6, 2011


yes - you want a hash table (or dictionary) for this. This is actually an assignment that I give to beginning students, btw.
A dictionary maps keys (words, in this case) to values (the number of times the word occurs).

What you want to do is:
For each line in the file :
split into words and lowercase it as you've done. You might also want to strip off punctuation. (read about strip() to see how to do this.)
for each word:
see if it's in your dictionary. If so, increment the count. If not, dic[word] = 1.
posted by chbrooks at 12:00 PM on June 6, 2011 [1 favorite]


Python dictionaries are hash tables. The text is your key, the number of instances is the value.
posted by djb at 12:01 PM on June 6, 2011


If you are using Python 2.7 or later, I would do it this way:
import collections

infile = open("myfile.txt")
words = collections.Counter()

for line in infile:
    words.update(line.split())
                              
for word, count in words.iteritems():
    print word, count
The collections.Counter class is a dict subclass that is designed exactly for this use case, so there is no need to check for initialization to zero, or so on. Dealing with punctuation as chbrooks suggests is left as an exercise to the reader.
posted by grouse at 12:13 PM on June 6, 2011 [8 favorites]


If this isn't just a way of learning Python, you might want to look into Python's Natural Language Processing Toolkit (and free book). It probably includes a utility to do this for you. Even if it doesn't, it provides functions to get stems for all words in a document (or corpus) to give you a different picture of usage.
posted by yerfatma at 12:34 PM on June 6, 2011


The one addition I'd make to grouse's code is that you probably want that list sorted, say, high to low:

sortedwords = sorted([(w, words[w]) for w in words.keys()], key=lambda i: i[1], reverse=True)

Which creates an array of tuples, (word, word count), and then sorts by the second element of that tuple.
posted by straw at 1:08 PM on June 6, 2011


True, although I would do it like this:
import operator

sorted_words = sorted(words.iteritems(), key=operator.itemgetter(1), reverse=True)
for word, count in sorted_words():
    print word, count
You don't need to create the list of tuples yourself, because dict.iteritems() does that for you. And operator.itemgetter() is, again, designed exactly for use as a key function to sorted().
posted by grouse at 1:21 PM on June 6, 2011 [3 favorites]


Thanks, grouse, that's the difference between someone who actually knows Python and someone who read a book on it once...
posted by straw at 1:22 PM on June 6, 2011


If you're not using Python 2.7, you can get a pretty similar effect with a defaultdict, available since Python 2.5:
from collections import defaultdict
from operator import itemgetter

infile = open("test.txt")
words = defaultdict(int)

for line in infile:
    for word in line.split():
        words[word] += 1

sorted_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)
for word, count in sorted_words:
    print word, count

posted by SemiSophos at 1:42 PM on June 6, 2011 [2 favorites]


FYI a great place to ask questions like this is stackoverflow.
posted by katrielalex at 1:55 PM on June 6, 2011


Response by poster: Many thanks guys. I will need to work through the information you've given me at a point where I have not just had three Jack Daniels, i.e. not now. My aim with the programme is both to learn Python and to have something I can actually use when editing long documents.

If you guys feel like answering a bonus question before I get back to this tomorrow evening: how would you expand beyond single words and up to phrases of say maximum five words?
posted by StephenF at 3:59 PM on June 6, 2011


Assuming you are only interested in ordered phrases (rather than an unordered bag of five words), here is one way:
import collections
import operator

PHRASE_SIZE = 5

infile = open("myfile.txt")
counts = collections.Counter()

for line in infile:
    words = line.split()
    counts.update(zip(*(words[offset:] for offset in xrange(PHRASE_SIZE))))

sorted_counts = sorted(counts.iteritems(), key=itemgetter(1), reverse=True)
for phrase, count in sorted_counts.iteritems():
    print "\t".join([" ".join(phrase), str(count)])
This prints the phrases and counts in a tab-delimited style instead of separated by spaces. You should be able to look up additional functions like zip, xrange, str.join. The use of * in an argument means that you are drawing the rest of the arguments from a sequence object. zip combined with a *-argument can be incredibly useful.
posted by grouse at 4:21 PM on June 6, 2011


itemgetter needs to be imported from operator, obviously.
posted by grouse at 4:23 PM on June 6, 2011


This is a great question, and I also use it to teach my students file IO, sorting, and dictionaries, which are Python's hash tables. Whenever I learn a new language, this program is the first thing I write.

grouse covered a short, idiomatic way of solving this problem. It's very good. I tend to show a different way, which uses neither collections nor operator. This code is not as good as grouse's, trading brevity for being a little easier to understand at first. In the real world, you'd be much more likely to see grouse's solution. That said:
word_to_count = {}

for line in open('myfile.txt'):
  words = line.split()
  for word in words:
    word_to_count[word] = word_to_count.get(word, 0) + 1

def GetCount(pair):
  return pair[1]

sorted_words = sorted(word_to_count.iteritems(), key=GetCount, reverse=True)
Some notes:

The first hard part of this program is that we build word_to_count manually, by using the dictionary's built-in get method. The meaning of this line in English is, "Get the current count for a given word, defaulting to 0 if we've never seen it before. Add one, and write the new value over the old value".

We need to sort word_to_count after we build it. Why? Because dictionaries are unordered. This is a side-effect of how they are implemented, and if you just print out their contents they'll come out sorted in a way that's efficient for the machine but nonsensical to humans. sorted is the Python built-in for taking any iterable and returning a new, sorted list from that iterable.

word_to_count.iteritems returns an iterator over the dictionary's (word, count) pairs. We want to sort by the count, so we need some way to extract it from the pair. That's what GetCount does: it takes a (word, count) pair and returns its second element. This is also exactly what operator.itemgetter(1) and lambda i: i[1] do.

If you want a more thorough explanation of this problem, it's covered in the Dicts and Files lecture (video, notes) from Google's Python class.

The code above is case-sensitive and doesn't strip punctuation, meaning that it considers 'A' and 'a' to be different words, and would happily insert the last word of a sentence along with its trailing period into word_to_count. To fix this, you'd want to lowercase each word and strip off punctuation before the word_to_count[word] = ... line. I would make a helper method to do this. Again, optimizing for clarity, you might add the following to the top of your program:
import re
STRIP_PUNCTUATION_REGEX = re.compile(r'[".,;:-]')

def GetTokenFromWord(word):
  return STRIP_PUNCTUATION_REGEX.sub('', word.lower())
Regular expressions are a complicated subject. Briefly, re.compile builds a pattern. Calling a pattern's sub method lets you substitute every match of that pattern with a given replacement, which because we want to strip characters is the empty string. The contents of the pattern are the characters we want to consider matches. In this example, that's double quote, period, comma, semicolon, colon, and dash. The uppercase variable name is the conventional Python way of indicating a module-level value that shouldn't be changed after it's first set.
posted by amery at 12:15 AM on June 7, 2011


If you need real time help, and aren't averse to IRC, come and join us: irc.freenode.net at ##python-friendly , and we love helping out with this sort of thing!
posted by gregglind at 5:31 AM on June 7, 2011


up to phrases of say maximum five words

To beat a dead horse (since I'm drinking now), NLTK will examine bigrams, trigrams, etc. You'd be best off with that. Otherwise you'll run into some fun dark corners which will teach you more about programming. So, your choice.
posted by yerfatma at 6:40 PM on June 8, 2011


« Older criminals, victims, and criminal victims   |   Check yourself before you wreck yourself Newer »
This thread is closed to new comments.