Finding number of unique words in a document using Python
June 6, 2011 11:53 AM Subscribe
Newbie Python advice required: counting unique keywords in a document.
posted by StephenF to computers & internet (16 answers total) 5 users marked this as a favorite
I'm almost completely new to Python, and have been trying to write a programme to show the count of each unique word in a document. So what I want at the end is an output that tells me there are 10 uses of 'and', 5 uses of 'it', 23 uses of 'of' and so on.
The way I've approached it to this point is:
- Read the text tile using open and read
- Split the text using text.split()
- Convert everything to lowercase using text.lower()
- Create a set of the individual words which automatically filters so that it only contains unique words
The length of the set gives me the total number of uniques, but that's not what I'm after. If I could iterate through the records in the set, I could count how many times each unique word appears in the text when it has been split into words from the original, but if it is possibly to iterate through a set in that way in Python, I have not been able to figure out how to do it. (I feel that part of my difficulty is that I am 'cheating' by using the set functionality to filter for uniques, and I should be doing this for myself in some way, presumably with regular expressions(?))
From some research, I believe that another way to approach this problem would be using the dictionary in Python and a hash table, though I don't fully understand what that means.
My question is then: is there a way to get to what I want using what I have done so far, or do I need to start again by learning about some other technique (e.g. hash tables)?
Thanks as ever, MeFi.