Text analysis software
December 12, 2010 12:36 PM   Subscribe

Suggestions for text analysis tools, please! Following a short discussion in the Metatalk infodump thread with cortex and iamkimiam, I realized that we'd appreciate suggestions for tools that can analyze text corpora and text files. Thanks!

We have different needs - the Mefi infodump has approaching a billion words in it, whereas I'm interested in analyzing transcripts and documents where 100k words would be very large - but we were asking ourselves the same questions. Areas of interest include frequency counts (I think we've kind of figured that out), collocation analyses, part-of-speech tagging and sorting, as well as experiences with data cleaning. Also a document clustering function would be neat, at least for me. I can't speak for the others but I'm definitely interested free/cheap, easy-to-use, and Mac. Personally, I'm currently using Nisus just to sort things, but it's very basic.
posted by carter to Writing & Language (5 answers total) 8 users marked this as a favorite
 
Best answer: If you do any programming (or if you're serious enough about this sort of thing that you're willing to learn some very basic programming) you should look at NLTK, which is a Python library for doing corpus linguistics and text analytics. It's well-documented and very beginner-friendly, but you do have to write a little code — sometimes just a line or two — in order to do anything non-trivial with it, and I don't know if that works for you.
posted by nebulawindphone at 1:21 PM on December 12, 2010 [1 favorite]


Best answer: Gary King has a selection of papers and software for doing automated content analysis here.
posted by proj at 3:09 PM on December 12, 2010


Best answer: There will be a ton of Perl modules for this. Hard to know where to start but obviously in the Text::* or Lingua::* hierarchies there will be lots of stuff.

Part Of Speech analysis I found here, for example.
posted by AmbroseChapel at 5:29 PM on December 12, 2010


Best answer: Lextutor has a series of tools you can use to analyze corpora
posted by mukade at 2:55 AM on December 13, 2010


Response by poster: Belated thanks, everyone! This is all very useful and also new to me. I'm going to start with mukade's suggestion - Lextutor looks very interesting (and also plug-n-play) and I actually really like the Keyword tool as a way of getting a measure of 'aboutness' for a particular document. I've just fed it some interview transcripts and it has already come up with some interesting results. Cool!
posted by carter at 6:45 PM on December 13, 2010


« Older Sprechen sie Deutsch?   |   Uneasy Listening? Newer »
This thread is closed to new comments.