How to extract relevant terms from text?
March 1, 2007 2:44 AM Subscribe
How can I extract relevant terms from big chunks of text in a similar style to the Yahoo Term Extraction API?
I've been trying to find a way to extract relevant terms from *lots* of big chunks of text in a similar way to the Yahoo Term Extraction API. Half my problem is that I can't seem top come up with the right vocabulary to plug in to search engines.
The Yahoo API isn't an option since there's a lot of text, it's commercial and I can't add the 'query' term to seed it.
I'd love to be able to end up with a bunch of interesting words and some sort of relevancy score or weight.
I've been trying to find a way to extract relevant terms from *lots* of big chunks of text in a similar way to the Yahoo Term Extraction API. Half my problem is that I can't seem top come up with the right vocabulary to plug in to search engines.
The Yahoo API isn't an option since there's a lot of text, it's commercial and I can't add the 'query' term to seed it.
I'd love to be able to end up with a bunch of interesting words and some sort of relevancy score or weight.
Best answer: I think you might be looking for noun phrases. There's a perl module called Lingua::En::Tagger that might help you do what you want.
Other than that, you're after natural language processing software that's going to cost you. You have to put the hard yards in yourself if you want it for free/cheap.
posted by singingfish at 3:58 AM on March 1, 2007
Other than that, you're after natural language processing software that's going to cost you. You have to put the hard yards in yourself if you want it for free/cheap.
posted by singingfish at 3:58 AM on March 1, 2007
you might try to search on "tfidf"
TFIDF is "term frequency [but] inverse document frequency" It can be used to find quote-unquote important words in a document. This is a simple and well-worn algorithm for finding words that have a high frequency of a word that is rare in the larger corpus.
I'm not sure if it will help, but it may be worth reading on the wikipedia page to see if it might.
posted by zpousman at 8:36 AM on March 1, 2007
TFIDF is "term frequency [but] inverse document frequency" It can be used to find quote-unquote important words in a document. This is a simple and well-worn algorithm for finding words that have a high frequency of a word that is rare in the larger corpus.
I'm not sure if it will help, but it may be worth reading on the wikipedia page to see if it might.
posted by zpousman at 8:36 AM on March 1, 2007
If you simply want to extract (frequent) words and phrases of certain types (probably nouns and some kinds of noun phrases), you could use a nlp toolkit like nltk-lite to tokenize, pos-tag and chunk a text and filter the results, this shouldn't be a lot of work.
If you want to find the words which are "interesting" with respect to a (domain-specific) reference corpus, things get more complicated and this paper and maybe this one should be helpful.
posted by snownoid at 9:22 AM on March 1, 2007
If you want to find the words which are "interesting" with respect to a (domain-specific) reference corpus, things get more complicated and this paper and maybe this one should be helpful.
posted by snownoid at 9:22 AM on March 1, 2007
This thread is closed to new comments.
posted by Blazecock Pileon at 2:58 AM on March 1, 2007