Document taxonomy
August 5, 2011 5:39 PM Subscribe
Looking for a document clustering application for a one-off search. Would prefer not to code my own.
I have a collection of several thousand documents that, ideally, I would like sorted into hierarchical trees to see which documents are most similar. For the most part I'm looking at documents which may have mutated slightly, e.g. drafts 1-6 may be scattered across a filesystem. I've looked at Carrot2, Rapidminer, and Solr, and it seems like there are some expensive legal discovery packages which do this.
The documents are a mix of typical office formats and pdfs, some of which may be hundreds of pages. Command-line or GUI is fine, as long as it produces a reasonable summary of document nearness. I don't mind coding a bit if there are reasonably simple libraries for this but don't have a whole week to spend on it. tm for R looks promising but I am unsure how to implement the clustering after parsing the docs.
Are there any simple free/trial packages or addons to existing desktop search engines/CMSes that do this?
I have a collection of several thousand documents that, ideally, I would like sorted into hierarchical trees to see which documents are most similar. For the most part I'm looking at documents which may have mutated slightly, e.g. drafts 1-6 may be scattered across a filesystem. I've looked at Carrot2, Rapidminer, and Solr, and it seems like there are some expensive legal discovery packages which do this.
The documents are a mix of typical office formats and pdfs, some of which may be hundreds of pages. Command-line or GUI is fine, as long as it produces a reasonable summary of document nearness. I don't mind coding a bit if there are reasonably simple libraries for this but don't have a whole week to spend on it. tm for R looks promising but I am unsure how to implement the clustering after parsing the docs.
Are there any simple free/trial packages or addons to existing desktop search engines/CMSes that do this?
it might be a little involved, but i think you could get lemur to do this.
posted by nml at 11:26 PM on August 5, 2011
posted by nml at 11:26 PM on August 5, 2011
You said you looked at Rapidminer - was it not doing the things you wanted, or just unhelpful about letting you know how to get there? In my experience it's pretty good for basic similarity calculations like this, especially in its 5.0 version, but tough to get to grips with in the beginning and lacking in decent documentation for its newer Text plugin. I'd strongly recommend the short video tutorials on 'Text Mining with Rapidminer' here if you want to give it another shot. (If you're looking for decent visualisations at the end of the process, though, Rapidminer is not your friend.)
posted by Catseye at 1:27 AM on August 6, 2011
posted by Catseye at 1:27 AM on August 6, 2011
This thread is closed to new comments.
posted by pla at 5:53 PM on August 5, 2011