Text analysis software
December 12, 2010 12:36 PM Subscribe
Suggestions for text analysis tools, please! Following a short discussion in the Metatalk infodump thread with cortex and iamkimiam, I realized that we'd appreciate suggestions for tools that can analyze text corpora and text files. Thanks!
We have different needs - the Mefi infodump has approaching a billion words in it, whereas I'm interested in analyzing transcripts and documents where 100k words would be very large - but we were asking ourselves the same questions. Areas of interest include frequency counts (I think we've kind of figured that out), collocation analyses, part-of-speech tagging and sorting, as well as experiences with data cleaning. Also a document clustering function would be neat, at least for me. I can't speak for the others but I'm definitely interested free/cheap, easy-to-use, and Mac. Personally, I'm currently using Nisus just to sort things, but it's very basic.
We have different needs - the Mefi infodump has approaching a billion words in it, whereas I'm interested in analyzing transcripts and documents where 100k words would be very large - but we were asking ourselves the same questions. Areas of interest include frequency counts (I think we've kind of figured that out), collocation analyses, part-of-speech tagging and sorting, as well as experiences with data cleaning. Also a document clustering function would be neat, at least for me. I can't speak for the others but I'm definitely interested free/cheap, easy-to-use, and Mac. Personally, I'm currently using Nisus just to sort things, but it's very basic.
Best answer: Gary King has a selection of papers and software for doing automated content analysis here.
posted by proj at 3:09 PM on December 12, 2010
posted by proj at 3:09 PM on December 12, 2010
Best answer: There will be a ton of Perl modules for this. Hard to know where to start but obviously in the Text::* or Lingua::* hierarchies there will be lots of stuff.
Part Of Speech analysis I found here, for example.
posted by AmbroseChapel at 5:29 PM on December 12, 2010
Part Of Speech analysis I found here, for example.
posted by AmbroseChapel at 5:29 PM on December 12, 2010
Best answer: Lextutor has a series of tools you can use to analyze corpora
posted by mukade at 2:55 AM on December 13, 2010
posted by mukade at 2:55 AM on December 13, 2010
Response by poster: Belated thanks, everyone! This is all very useful and also new to me. I'm going to start with mukade's suggestion - Lextutor looks very interesting (and also plug-n-play) and I actually really like the Keyword tool as a way of getting a measure of 'aboutness' for a particular document. I've just fed it some interview transcripts and it has already come up with some interesting results. Cool!
posted by carter at 6:45 PM on December 13, 2010
posted by carter at 6:45 PM on December 13, 2010
This thread is closed to new comments.
posted by nebulawindphone at 1:21 PM on December 12, 2010 [1 favorite]