Text analysis software
December 12, 2010 12:36 PM   Subscribe

Suggestions for text analysis tools, please! Following a short discussion in the Metatalk infodump thread with cortex and iamkimiam, I realized that we'd appreciate suggestions for tools that can analyze text corpora and text files. Thanks!

We have different needs - the Mefi infodump has approaching a billion words in it, whereas I'm interested in analyzing transcripts and documents where 100k words would be very large - but we were asking ourselves the same questions. Areas of interest include frequency counts (I think we've kind of figured that out), collocation analyses, part-of-speech tagging and sorting, as well as experiences with data cleaning. Also a document clustering function would be neat, at least for me. I can't speak for the others but I'm definitely interested free/cheap, easy-to-use, and Mac. Personally, I'm currently using Nisus just to sort things, but it's very basic.
posted by carter to Writing & Language (5 answers total) 8 users marked this as a favorite
If you do any programming (or if you're serious enough about this sort of thing that you're willing to learn some very basic programming) you should look at NLTK, which is a Python library for doing corpus linguistics and text analytics. It's well-documented and very beginner-friendly, but you do have to write a little code — sometimes just a line or two — in order to do anything non-trivial with it, and I don't know if that works for you.
posted by nebulawindphone at 1:21 PM on December 12, 2010 [1 favorite]

Gary King has a selection of papers and software for doing automated content analysis here.
posted by proj at 3:09 PM on December 12, 2010

There will be a ton of Perl modules for this. Hard to know where to start but obviously in the Text::* or Lingua::* hierarchies there will be lots of stuff.

Part Of Speech analysis I found here, for example.
posted by AmbroseChapel at 5:29 PM on December 12, 2010

Lextutor has a series of tools you can use to analyze corpora
posted by mukade at 2:55 AM on December 13, 2010

Belated thanks, everyone! This is all very useful and also new to me. I'm going to start with mukade's suggestion - Lextutor looks very interesting (and also plug-n-play) and I actually really like the Keyword tool as a way of getting a measure of 'aboutness' for a particular document. I've just fed it some interview transcripts and it has already come up with some interesting results. Cool!
posted by carter at 6:45 PM on December 13, 2010

« Older Sprechen sie Deutsch?   |   Uneasy Listening? Newer »
This thread is closed to new comments.