Is there a tool to trawl a website and generate a frequency count for specific words I'm interested in?
August 6, 2010 5:43 AM   Subscribe

Is there a tool to trawl a website and generate a frequency count for specific words I'm interested in?

I'd like to find a tool to trawl a website forum (and all its posts) and generate a word frequency count for specific words I'm interested in (not the whole shooting match, just words I choose).

Do you know of a way to do this, without trawling through thousands of posts manually?

posted by 6am to Writing & Language (9 answers total)
This would be pretty easy to script in, say, Python -- use the builtin urllib to download the posts, lxml or BeautifulSoup to parse the HTML, and roll your own frequenty counter.
posted by katrielalex at 6:27 AM on August 6, 2010

Crawl the site with wget or httrack and then use whatever scripting language you're most familiar with to extract word count frequency -- it could be done in just a handful of lines of perl for example.
posted by Rhomboid at 6:32 AM on August 6, 2010

Download as much text as you want into a big file (as per previous answers - wget and then concatenating all the files might be good). Do this to strip all the HTML tags out:

cat yourfile.html | perl -pe 's/<>/ /g;' > yourfile.txt

(though if there are inline scripts, you'll be word-counting their code too. There are other ways of doing HTML-to-text conversions if you google 'em though). Then run the big file through this command to split into lines on whitespace and count 'em:

cat yourfile.txt | perl -pe '$_=join("\n",split(/\s+/));' | sort | uniq -c

If you only want one specific word, do this:

cat yourfile.txt | perl -pe '$_=join("\n",split(/\s+/));' | grep ^yourword$ | sort | uniq -c

Caveat: you need a real operating system. You know, with Perl and a shell and the usual tools that come with that. Cygwin will suffice in a pinch.
posted by polyglot at 6:53 AM on August 6, 2010

Damn HTML. First command should be this:

cat yourfile.html | perl -pe 's/<.+?>/ /g;' > yourfile.txt
posted by polyglot at 6:54 AM on August 6, 2010

Response by poster: Thanks for the tips - only thing is what if I'm a total scripting virgin?! x
posted by 6am at 7:00 AM on August 6, 2010

From the command line (on a real operating system as specified by polyglot) the quickest way of getting just the text from a web page without the markup is probably to do:

lynx -dump ''

...on whatever URL you're interested in (requires having Lynx installed, obviously). You can also do that with a local HTML file if you've pulled it down first with wget or curl. The output from Lynx can be piped to Perl for word-splitting and counting.
posted by letourneau at 7:46 AM on August 6, 2010

Oh, and add the -nolist option to the lynx command to have it omit its footnote-style annotation of HTML links in the output dump (since that's irrelevant to your word-counting needs).
posted by letourneau at 7:48 AM on August 6, 2010

Best answer: You could get a rough idea with google, just add '' to the query and see how many results it returns, it won't be entirely accurate if some pages use the word more than once or if some pages are not in the google index.
posted by Lanark at 9:56 AM on August 6, 2010

Scripting virgins become scripting, erm, whores by giving it a go and discovering that they rather like it ;) Assuming you have Windows, install cygwin including Perl, fire up bash (it will install an icon for it) and type in the commands above. Might want to google for a quick bash tutorial on how to change directory, delete stuff, etc, to get you started - it really is very very easy.

Of course if you have further questions, there are endless references and tutorials on ye olde Internette, or you could feel free to MeMail me about this particular thing.
posted by polyglot at 12:40 AM on August 11, 2010

« Older If all my other plans fail, I'm opening a Tako...   |   Sending a vehicle across Australia! Newer »
This thread is closed to new comments.