Document taxonomy
August 5, 2011 5:39 PM Subscribe
Looking for a document clustering application for a one-off search. Would prefer not to code my own.
I have a collection of several thousand documents that, ideally, I would like sorted into hierarchical trees to see which documents are most similar. For the most part I'm looking at documents which may have mutated slightly, e.g. drafts 1-6 may be scattered across a filesystem. I've looked at Carrot2, Rapidminer, and Solr, and it seems like there are some expensive legal discovery packages which do this.
The documents are a mix of typical office formats and pdfs, some of which may be hundreds of pages. Command-line or GUI is fine, as long as it produces a reasonable summary of document nearness. I don't mind coding a bit if there are reasonably simple libraries for this but don't have a whole week to spend on it. tm for R looks promising but I am unsure how to implement the clustering after parsing the docs.
Are there any simple free/trial packages or addons to existing desktop search engines/CMSes that do this?
posted by benzenedream to computers & internet (3 answers total) 2 users marked this as a favorite
posted by pla at 5:53 PM on August 5, 2011