Skip

Generating a Tag Cloud From Really Large Text Files
October 5, 2008 2:13 PM   Subscribe

I want to generate tag clouds for really large text files (the largest is 85 MB). No online service that I've found will support a file this big. Any suggestions?

I grabbed the a copy of each of the websites of the four national, major Canadian political parties. I want to generate tag clouds for each party, showing the terms they use most often on their sites.

I had a sysadmin friend of mine (I'm so not a programmer) use some regular expression magic to merge each site into a single file. Thus, I have four very large text files. I considered things like Wordle or Many Eyes, but they have a much smaller maximum file size than I need. What should I do?

One alternative: get my friend or somebody to produce a list of words for each site and the number of times they appear. Then I create a small file with the right number of each of, say, the top 100 words, and use Wordle on that. Any better suggestions? Or, you know, help?
posted by dbarefoot to Computers & Internet (9 answers total) 4 users marked this as a favorite
 
I would ask him to redo it and make each page it's own file, so you have many small files.
posted by jesirose at 2:26 PM on October 5, 2008


Use the alternative method. What operating system is running on the machine that has the files? If it happens to be linux or mac, I bet we can come up with a script to type in that minimizes your file for you.
posted by demiurge at 2:29 PM on October 5, 2008


Jesirose: I'm not sure how that will help. I still need one file to feed to any of these online services to generate one tag cloud.

Demiurge: I could run the script on a Mac.
posted by dbarefoot at 3:15 PM on October 5, 2008


You say you're not a programmer -- are you interested in doing any programming? This would be short work for a Perl script (I whipped up a ~15-line proof-of-concept that takes about 5 minutes to parse a sample 50mb file; MeMail me if you're interested).

Another alternative would be to take the data and replace all of the spaces with newline characters, then import the new file (which would have one word per line) into an Access database, then do a simple aggregate query against the results (e.g., "SELECT word, COUNT(*) FROM MyTable GROUP BY word").
posted by Doofus Magoo at 3:36 PM on October 5, 2008


There's a thread at BoingBoing that discusses a similar need, with many suggested solutions.
posted by chazlarson at 4:24 PM on October 5, 2008


I spent way too much time on this, but try using the files here.

You'll have to download the stop words file (which I got originally from http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words ). This is used by the script to eliminate common words (she, am, who, etc.). You can tailor it to your needs if you wish.

Then run the script on the file like this:

python txtshrink.py your_data.txt

It should print out a buch of *s while it's processing, and make a output.txt file with a limited word list when it's done, as well as printing out some statistics. The output file is a text file with the major words with the right frequency, but all counts are divided by 10. Due to the size of the files, you may need to make it 100.

I based my script on one Steve Holden wrote.
posted by demiurge at 6:25 PM on October 5, 2008


demiurge: Awesome, thanks very much. I'll give that a try and report back on how it turns out.
posted by dbarefoot at 7:29 PM on October 5, 2008


TextArc might work? However, it's will generate a dense cloud with every word of the text - the more commonly-used words appearing larger.
posted by cogat at 10:05 PM on October 5, 2008


I said I'd follow up when the tag clouds are done, so here they are.
posted by dbarefoot at 12:37 PM on October 9, 2008


« Older I enjoy literature with a dyst...   |  Our Xbox 360 is scratching the... Newer »
This thread is closed to new comments.


Post