Document taxonomy
August 5, 2011 5:39 PM   Subscribe

Looking for a document clustering application for a one-off search. Would prefer not to code my own.

I have a collection of several thousand documents that, ideally, I would like sorted into hierarchical trees to see which documents are most similar. For the most part I'm looking at documents which may have mutated slightly, e.g. drafts 1-6 may be scattered across a filesystem. I've looked at Carrot2, Rapidminer, and Solr, and it seems like there are some expensive legal discovery packages which do this.

The documents are a mix of typical office formats and pdfs, some of which may be hundreds of pages. Command-line or GUI is fine, as long as it produces a reasonable summary of document nearness. I don't mind coding a bit if there are reasonably simple libraries for this but don't have a whole week to spend on it. tm for R looks promising but I am unsure how to implement the clustering after parsing the docs.

Are there any simple free/trial packages or addons to existing desktop search engines/CMSes that do this?
posted by benzenedream to Computers & Internet (3 answers total) 2 users marked this as a favorite
Thumbs Plus has a pretty decent "find similar images" feature. Probably a bit too imprecise for what you want, but it will compare an entire disk of images and does a pretty good job of finding ones that "look" similar (for example, I scan-and-shred just about everything, and it can dead-on find all my statements from bank-X).
posted by pla at 5:53 PM on August 5, 2011

it might be a little involved, but i think you could get lemur to do this.
posted by nml at 11:26 PM on August 5, 2011

You said you looked at Rapidminer - was it not doing the things you wanted, or just unhelpful about letting you know how to get there? In my experience it's pretty good for basic similarity calculations like this, especially in its 5.0 version, but tough to get to grips with in the beginning and lacking in decent documentation for its newer Text plugin. I'd strongly recommend the short video tutorials on 'Text Mining with Rapidminer' here if you want to give it another shot. (If you're looking for decent visualisations at the end of the process, though, Rapidminer is not your friend.)
posted by Catseye at 1:27 AM on August 6, 2011

« Older Scrumptious. Or not.   |   Fix the Fatal Fuzz! Newer »
This thread is closed to new comments.