You say "validation," they say "verification"....
January 15, 2010 2:25 PM   Subscribe

How can I parse several largish (~6mb) text documents to produce a common index of keywords and phrases? I need something that will recognize phrases as well as key words, kind of like Amazon's Statistically Improbable Phrases.

I am looking to reconcile terminology in the user requirements documents of a dozen different user organizations who are stakeholders for the same large system under development.

I need to reconcile and document that terminology and square it with the development team's understanding. There are about 18 or so documents and I would like a nifty software thing that would parse them (after a reasonable amount of preprocessing, if necessary) and spit out an index of keywords and phrases that are candidates for "jargon" that needs to be defined and/or reconciled (and the documents/user organizations that use them). Any help?

Oh yeah. And, of course, I have no budget for tools.
posted by cross_impact to Computers & Internet (5 answers total)
One simple thing you could do is check the document for introductions of initialisms and acronyms. For example, processing the sentence "The Federal Deposit Insurance Corporation (FDIC) is a United States government corporation created by the Glass-Steagall Act of 1933" should associate FDIC with "Federal Deposit Insurance Corporation." You could do this by keeping track of the N previous words in a document, and whenever you come across a string of capitalized characters enclosed in parentheses, check to see if the first letters of the preceding words match the characters enclosed in parentheses.

That will get you something, if that sort of convention is used in these documents.
posted by seliopou at 2:59 PM on January 15, 2010

This might be a step in the right direction:
it uses the Yahoo! Api's ContextualAnalysisService to extract key phrases from text you paste into the field there. That's only the first part of what you're looking for, but it's the best part.
posted by xueexueg at 3:45 PM on January 15, 2010

try looking for some free bayesian filtering programs? maybe you can get some free and open source email spam filter to help you out?
posted by mhh5 at 4:15 PM on January 15, 2010

Are you manually defining the terms and phrases you are after? If so, grep was built for this.

It was originally a Unix thing and is now mainly a Linux thing, but I'm going to take a wild guess that you'll need a Windows variant.
posted by Mr. Anthropomorphism at 8:03 PM on January 15, 2010

Taking a corpus linguistics approach, you could probably do something with a concordancer; there are a few decent free ones available for various platforms.

You'd have to read the documentation of your favorite one to determine the syntax for asking it the right questions, but I'd think you could get some good results this way.
posted by treblemaker at 9:29 PM on January 15, 2010

« Older So simple a child could likely code this   |   Cincinnati Question! Address of Kroger Store with... Newer »
This thread is closed to new comments.