The (Facet-Searchable) Paperless Office?
June 4, 2010 7:05 AM   Subscribe

Are there any commercial products intended to (or could be set up to) mine content from unstructured office documents (think: memos, presentations) and populate fields in a database for faceted search, or is this still the domain of research?

I'm looking for software that, given a vague idea of the structure that the documents probably have, does Named Entity Retrieval / matching against named entities in a known list, classifies documents based on training data, extracts chunks of text that match certain patterns and inserts them into a database as fields for future faceted search, that kind of thing. The documents need to remain editable (so that, eg, sections of a PowerPoint can be copied for re-use once located).

I know a lot of full-text indexing/retrieval (IMR Alchemy, etc) software, but most things in that space are archival (eg, doesn't work with editable documents), and as far as I can tell most can't do automated indexing for faceted search.

I work with folks who do research in areas surrounding IR and document analysis, but I'm not aware of any off-the-shelf products doing this kind of work. I suspect it's kind of a holy grail in IR and if anything exists, it's in the lab and not entirely stable. Are there any commercial products doing the build-a-faceted-index-based-on-content-analysis stuff, though? Assume cost is no object.
posted by Alterscape to Computers & Internet (3 answers total)
I'm not entirely sure this is what you're looking for, but how about something like WestKM?
posted by marginaliana at 7:30 AM on June 4, 2010

Yes -- something like WestKM, but for a non-law domain (I can't mention the specific domain).
posted by Alterscape at 9:36 AM on June 4, 2010

(Forgive me in advance if this sounds a bit ranty. My opinion comes from having programmed lots of IR tasks, and storage/search engines.)

1.) If cost really is no object, hire people to do index it properly, like WestLaw and friends do :) Check for consistency in results (Amazon Turk style), to reduce errors. I suspect that money isn't *that* infinite!

2.) If you just want search on it, get a Google search box.

3.) If you really want bayesian search with classifiers an the like, consider something like Calais. There are lots of academic projects as well.

4.) SO has some good discussions of algorithms that may or may not have links to commercial offerings. YMMV.

From your description, it sounds like you're getting into near AI-level (which is common in information retrieval tasks) algorithms. The closer you need to AI, the less sense it makes for computers to actually do it, and the cheaper an army of indexers looks.

Best of luck!
posted by gregglind at 10:01 AM on June 4, 2010

« Older Underwriters of the world unite   |   Imagine "The Warriors", but on a boat! Newer »
This thread is closed to new comments.