Natural Language Processing, relative basis for search.
August 8, 2009 4:05 PM Subscribe
Natural Language Processing (NLP) Filter. I'm looking for a method (or category of methods) to judge the relevancy of a small unit of text, 100-200 chars, against similar units of text.
I have a large set of these textual units, I'm trying to discover the "relatedness" of my query unit (an item drawn from that set) to any other unit in the set, but judged relatively to the set as a whole.
In other words, I'm not looking for just an ordered list of rankings of my query applied to every document in the set. Rather an ordered list of rankings of my query applied to every document in the set then normalized by the magnitude of my query applied to the set as a whole (perhaps via some averaging function.)
Do traditional search engines (open source, like lucene or xapian) do this already?
What I mean by relatedness is that we are talking about the same things by some, arbitrary, empirical measure. In other words, there is no supervised or unsupervised learning, just some off-the-shelf measure of 'relatedness'.
I have a large set of these textual units, I'm trying to discover the "relatedness" of my query unit (an item drawn from that set) to any other unit in the set, but judged relatively to the set as a whole.
In other words, I'm not looking for just an ordered list of rankings of my query applied to every document in the set. Rather an ordered list of rankings of my query applied to every document in the set then normalized by the magnitude of my query applied to the set as a whole (perhaps via some averaging function.)
Do traditional search engines (open source, like lucene or xapian) do this already?
What I mean by relatedness is that we are talking about the same things by some, arbitrary, empirical measure. In other words, there is no supervised or unsupervised learning, just some off-the-shelf measure of 'relatedness'.
Response by poster: Keeping notes as the come in:
NLTK is a probable platform. Specifically, the KMeansClusterer class.
This is a form of unsupervised learning. It appears to be an iterative process that 'clusters' similar samples via Hill Climbing algorithm. As always, it is the shape of the hills that matters here. So distance functions, metrics ,are crucial here.
posted by kuatto at 4:57 PM on August 8, 2009
NLTK is a probable platform. Specifically, the KMeansClusterer class.
This is a form of unsupervised learning. It appears to be an iterative process that 'clusters' similar samples via Hill Climbing algorithm. As always, it is the shape of the hills that matters here. So distance functions, metrics ,are crucial here.
posted by kuatto at 4:57 PM on August 8, 2009
So, by relatedness, you mean a measure of similarity? Is it that you want to see if your query contains the same words? Should the words be in the same order?
There is a genre of evaluation metrics used to evaluate the quality of machine translation output compared to translations generated by humans, that is, sets documents translated into the same language. They are all about similarity metrics and might be useful for giving you this kind of ordered ranking. One of the better, open source ones for your purposes might be METEOR.
posted by Alison at 5:39 PM on August 8, 2009
There is a genre of evaluation metrics used to evaluate the quality of machine translation output compared to translations generated by humans, that is, sets documents translated into the same language. They are all about similarity metrics and might be useful for giving you this kind of ordered ranking. One of the better, open source ones for your purposes might be METEOR.
posted by Alison at 5:39 PM on August 8, 2009
Seems like you want to start with something simple here as a distance metric, such as cosine similarity using TFIDF-weighted vectors. This is pretty standard in the IR world for comparing chunks of text.
I don't recall whether NLTK does this offhand, but it's pretty easy to write in Python; ~100 lines or so.
If you want to get fancy, you can either use NLTK to identify chunks and work with those, or else use something like WordNet to try to identify synonyms. But I'd start at the beginning.
posted by chbrooks at 5:58 PM on August 8, 2009
I don't recall whether NLTK does this offhand, but it's pretty easy to write in Python; ~100 lines or so.
If you want to get fancy, you can either use NLTK to identify chunks and work with those, or else use something like WordNet to try to identify synonyms. But I'd start at the beginning.
posted by chbrooks at 5:58 PM on August 8, 2009
Response by poster: As is so often the case, the question I started with needs refinement.
chbrooks, I think your onto it. I'm looking for a way to convert a short snippet of text into a semantically encoded vector just like TF-IDF, one suitable for clustering and other algorithms.
It looks like TF-IDF may be considered a baseline way to create metrics in a dataset of paragraphs. I just need the distances to match peoples perceptions, (e.g. "those two paragraphs kinda look the same"). If TF-IDF can do this, I would be very happy!
posted by kuatto at 6:16 PM on August 8, 2009
chbrooks, I think your onto it. I'm looking for a way to convert a short snippet of text into a semantically encoded vector just like TF-IDF, one suitable for clustering and other algorithms.
It looks like TF-IDF may be considered a baseline way to create metrics in a dataset of paragraphs. I just need the distances to match peoples perceptions, (e.g. "those two paragraphs kinda look the same"). If TF-IDF can do this, I would be very happy!
posted by kuatto at 6:16 PM on August 8, 2009
Best answer: hmm. TFIDF doesn't really deal with semantics at all. It just weights terms in a vector by how common they are in that vector and how rare in a corpus. Similarity will just be the fraction of shared terms, with rare terms weighted more heavily.
You might also want to look at Latent Semantic Indexing as a way to statistically infer semantic similarity between terms, based on co-occurrence in a corpus.
You can also do k-means clustering (or expectation maximization) with TFIDF-weighted vectors. As you point out, the question is still how to measure similarity. If you're going to use a vector model, some flavor of cosine similarity is a good place to start.
Now, if you want some that people will say is similar (as opposed to a statistical or IR-style shared-terms measure of similarity) I think you have two choices:
- Use a thesaurus such as WordNet to match related terms.
- Get humans to label some data for you and learn from that.
posted by chbrooks at 6:33 PM on August 8, 2009
You might also want to look at Latent Semantic Indexing as a way to statistically infer semantic similarity between terms, based on co-occurrence in a corpus.
You can also do k-means clustering (or expectation maximization) with TFIDF-weighted vectors. As you point out, the question is still how to measure similarity. If you're going to use a vector model, some flavor of cosine similarity is a good place to start.
Now, if you want some that people will say is similar (as opposed to a statistical or IR-style shared-terms measure of similarity) I think you have two choices:
- Use a thesaurus such as WordNet to match related terms.
- Get humans to label some data for you and learn from that.
posted by chbrooks at 6:33 PM on August 8, 2009
« Older Where's my hard-earned quarter? | Fog up the river, where it flows among green sits... Newer »
This thread is closed to new comments.
posted by kuatto at 4:35 PM on August 8, 2009