The (Facet-Searchable) Paperless Office?
June 4, 2010 7:05 AM Subscribe
Are there any commercial products intended to (or could be set up to) mine content from unstructured office documents (think: memos, presentations) and populate fields in a database for faceted search, or is this still the domain of research?
I'm looking for software that, given a vague idea of the structure that the documents probably have, does Named Entity Retrieval / matching against named entities in a known list, classifies documents based on training data, extracts chunks of text that match certain patterns and inserts them into a database as fields for future faceted search, that kind of thing. The documents need to remain editable (so that, eg, sections of a PowerPoint can be copied for re-use once located).
I know a lot of full-text indexing/retrieval (IMR Alchemy, etc) software, but most things in that space are archival (eg, doesn't work with editable documents), and as far as I can tell most can't do automated indexing for faceted search.
I work with folks who do research in areas surrounding IR and document analysis, but I'm not aware of any off-the-shelf products doing this kind of work. I suspect it's kind of a holy grail in IR and if anything exists, it's in the lab and not entirely stable. Are there any commercial products doing the build-a-faceted-index-based-on-content-analysis stuff, though? Assume cost is no object.
posted by Alterscape to computers & internet (3 answers total)
posted by marginaliana at 7:30 AM on June 4, 2010