# What do you call similarity calculation?September 14, 2010 5:13 PM   Subscribe

what do you call the science/math for calculating similarity between people based on survey results (like OK Cupid)?

I'm thinking along the lines of match-making, where the result is identifying a number of people who would be compatible with another person, based on a questionnaire.
posted by lrivers to Computers & Internet (12 answers total) 3 users marked this as a favorite

Statistics?
posted by paulg at 5:22 PM on September 14, 2010

Correlation.
posted by Wet Spot at 5:36 PM on September 14, 2010

Using a set of rules to show where members of a group do and don't overlap is an application of set theory.
posted by Babblesort at 5:37 PM on September 14, 2010

Maybe you could consider it as a specific type of collaborative filtering?
posted by bassooner at 5:40 PM on September 14, 2010

Statistical Matching

Consumer Modeling

Statistical Classification or Clustering
posted by gus at 6:41 PM on September 14, 2010

A generic name for the process that they're probably using is "dimension reduction."

What they have is a whole damn bunch of answers to questions. What they want to know is how similar two people are to each other.

The catch is that there are so many questions that figuring similarity on the basis of all the questions is tricky or computationally expensive. Basically, each question is a different dimension, and you're trying to locate how far apart two points are in a 100-dimensional hyperspace.

So, what you can use is any number of techniques that simplify the hyperspace and reduce the number of dimensions.

I don't know what they might do, but in other circumstances you could look at the answers a person gives to a series of political questions -- say 50 dimensions -- and try to reduce that to one or two dimensions of ideology; how liberal or conservative the person is.

One thing is that most of these techniques don't involve taking a bunch of questions and asking "HOW LIBERAL?!!?" Instead, what they do is look for underlying dimensions on the basis of how well the dimensions predict the answers. The techniques generally look for the fewest dimensions that predict the most answers.

Particular techniques include factor analysis, principal-components, item-response or Rausch models, and (in other contexts) ideal point estimation software.

I would bet a cheap lunch that... crap. That one that won't take gays, and where there's the creepy grandpa guy who runs it. Creepy grandpa guy talks about being matched on umpty dimensions, which implies to me that they're doing principal-components or some other dimension reduction method.
posted by ROU_Xenophobe at 6:50 PM on September 14, 2010 [4 favorites]

It's not matching. Matching is (shorthand) a way to try to extract a pseudo-experimental sample from observational data.
posted by ROU_Xenophobe at 6:53 PM on September 14, 2010

The algorithm is explained on the okcupid website:

http://www.okcupid.com/faaaq

Their algorithm is fairly unique, and wikipedia notes there is a patent pending. For various reasons, it is not really a straightforward application of correlation, nor clustering analysis, not classification, nor do they mention any need to do any sort of dimensional reduction techniques.
posted by JumpW at 7:12 PM on September 14, 2010

Although, more broadly (not just Okcupid), I think you could make a case for most of the answers above for the general task of "calculating similarity between people based on survey results" depending on the details of your application.
posted by JumpW at 7:20 PM on September 14, 2010

It would be very helpful to know if you mean how OKcupid matches, in general, or are you referring to something like this (the OKcupid blog), where they break down entire groups and compare commonalities...?
posted by 2oh1 at 9:30 PM on September 14, 2010

I think they're talking about how, for every person, OKCupid displays something like this:

* 90% Match
* 92% Friend
* 3% Enemy
posted by smackfu at 7:03 AM on September 15, 2010

In general, such a method is called a Similarity measure. One of the simplest methods, Cosine Similarity just finds the angle between two n-dimensional vectors. Many methods are implemented in the SimMetrics library.

Classification or Clustering are different problems; and clustering relies upon a similarity measure. Classification assigns objects to one of a pre-determined set of labels. Clustering assigns objects divides objects into clusters, where the specific nature of each cluster is not pre-determined.
posted by James Scott-Brown at 8:08 AM on September 15, 2010

« Older How Do Cities Vote?   |   I can't afford to quit. Newer »