import collections
infile = open("myfile.txt")
words = collections.Counter()
for line in infile:
words.update(line.split())
for word, count in words.iteritems():
print word, count
The collections.Counter class is a dict subclass that is designed exactly for this use case, so there is no need to check for initialization to zero, or so on. Dealing with punctuation as chbrooks suggests is left as an exercise to the reader.sortedwords = sorted([(w, words[w]) for w in words.keys()], key=lambda i: i[1], reverse=True)
import operator
sorted_words = sorted(words.iteritems(), key=operator.itemgetter(1), reverse=True)
for word, count in sorted_words():
print word, count
You don't need to create the list of tuples yourself, because dict.iteritems() does that for you. And operator.itemgetter() is, again, designed exactly for use as a key function to sorted().
from collections import defaultdict
from operator import itemgetter
infile = open("test.txt")
words = defaultdict(int)
for line in infile:
for word in line.split():
words[word] += 1
sorted_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)
for word, count in sorted_words:
print word, count
import collections
import operator
PHRASE_SIZE = 5
infile = open("myfile.txt")
counts = collections.Counter()
for line in infile:
words = line.split()
counts.update(zip(*(words[offset:] for offset in xrange(PHRASE_SIZE))))
sorted_counts = sorted(counts.iteritems(), key=itemgetter(1), reverse=True)
for phrase, count in sorted_counts.iteritems():
print "\t".join([" ".join(phrase), str(count)])
This prints the phrases and counts in a tab-delimited style instead of separated by spaces. You should be able to look up additional functions like zip, xrange, str.join. The use of * in an argument means that you are drawing the rest of the arguments from a sequence object. zip combined with a *-argument can be incredibly useful.collections nor operator. This code is not as good as grouse's, trading brevity for being a little easier to understand at first. In the real world, you'd be much more likely to see grouse's solution. That said:
word_to_count = {}
for line in open('myfile.txt'):
words = line.split()
for word in words:
word_to_count[word] = word_to_count.get(word, 0) + 1
def GetCount(pair):
return pair[1]
sorted_words = sorted(word_to_count.iteritems(), key=GetCount, reverse=True)
Some notes:word_to_count manually, by using the dictionary's built-in get method. The meaning of this line in English is, "Get the current count for a given word, defaulting to 0 if we've never seen it before. Add one, and write the new value over the old value".word_to_count after we build it. Why? Because dictionaries are unordered. This is a side-effect of how they are implemented, and if you just print out their contents they'll come out sorted in a way that's efficient for the machine but nonsensical to humans. sorted is the Python built-in for taking any iterable and returning a new, sorted list from that iterable.(word, count) pairs. We want to sort by the count, so we need some way to extract it from the pair. That's what GetCount does: it takes a (word, count) pair and returns its second element. This is also exactly what operator.itemgetter(1) and lambda i: i[1] do.word_to_count. To fix this, you'd want to lowercase each word and strip off punctuation before the word_to_count[word] = ... line. I would make a helper method to do this. Again, optimizing for clarity, you might add the following to the top of your program:
import re
STRIP_PUNCTUATION_REGEX = re.compile(r'[".,;:-]')
def GetTokenFromWord(word):
return STRIP_PUNCTUATION_REGEX.sub('', word.lower())
Regular expressions are a complicated subject. Briefly, re.compile builds a pattern. Calling a pattern's sub method lets you substitute every match of that pattern with a given replacement, which because we want to strip characters is the empty string. The contents of the pattern are the characters we want to consider matches. In this example, that's double quote, period, comma, semicolon, colon, and dash. The uppercase variable name is the conventional Python way of indicating a module-level value that shouldn't be changed after it's first set.
posted by scruss at 11:59 AM on June 6, 2011