# What's that algorithm?

November 28, 2005 1:59 PM Subscribe

StatHeads- Is there a name for the following analysis/process/algorithm?

I have a set of X things, each of which have Y attributes (think of 10 people, each with different hair, eyes, etc). For a given subset of Z things, I want to learn:

- Which attributes Y have in common but X-Y do not.

- Which attributes X-Y have in common, but Y do not.

Using the previous example, I'd want to take 3 people, and find out if, say, they all had green eyes, or that everyone BUT them lived in NY.

I'm sure there's some existant process for sorting this out, but don't know where to start.

I have a set of X things, each of which have Y attributes (think of 10 people, each with different hair, eyes, etc). For a given subset of Z things, I want to learn:

- Which attributes Y have in common but X-Y do not.

- Which attributes X-Y have in common, but Y do not.

Using the previous example, I'd want to take 3 people, and find out if, say, they all had green eyes, or that everyone BUT them lived in NY.

I'm sure there's some existant process for sorting this out, but don't know where to start.

*"Which attributes Y have in common but X-Y do not.*

Which attributes X-Y have in common, but Y do not."

Which attributes X-Y have in common, but Y do not."

This doesn't make sense. Y is number of attributes. X is number of things. Dimensional inequality.

posted by Gyan at 2:10 PM on November 28, 2005

Sorry, it should read:

- Which attributes Z have in common but X-Z do not.

- Which attributes X-Z have in common, but Z do not.

posted by mkultra at 2:13 PM on November 28, 2005

- Which attributes Z have in common but X-Z do not.

- Which attributes X-Z have in common, but Z do not.

posted by mkultra at 2:13 PM on November 28, 2005

Do you mean like when WalMart noticed that they sell a lot of beer in the buy-up right before a hurricane or snowstorm, so stores had better stock up then? I think that's normally just called data mining. I imagine the hard part of the endeavor is sorting through a big pile of statistical artifacts to find the real, nonspurious relationships. How they do this I dunno.

Really the best answer is to have a theory about what Z should have in common but X-Z should not, and test that theory with data. When you just up and ask the data to tell you a story, it can tell you many lies.

posted by ROU_Xenophobe at 2:17 PM on November 28, 2005

Really the best answer is to have a theory about what Z should have in common but X-Z should not, and test that theory with data. When you just up and ask the data to tell you a story, it can tell you many lies.

posted by ROU_Xenophobe at 2:17 PM on November 28, 2005

Building back on that, the good use for inductive learning (ie, asking the data to tell you a story like you describe) is so that you can be more easily build a relevant theory that you can then go test on new data, or test other implications of.

posted by ROU_Xenophobe at 2:20 PM on November 28, 2005

posted by ROU_Xenophobe at 2:20 PM on November 28, 2005

so you're dividing a set into two and want to know what properties are both common to all members of one "half", and absent from all members of the other "half"?

it strikes me that this is a backwards way of looking at things. it's equivalent to dividing the set into two by each particular property, recording the different sets that you form, and then using those results to construct what you described.

is the motivation that you have many more properties than members?

anyway - don't know. sounds like the kind of thing a biologist (or suitable mathematician) might, though.

posted by andrew cooke at 2:23 PM on November 28, 2005

it strikes me that this is a backwards way of looking at things. it's equivalent to dividing the set into two by each particular property, recording the different sets that you form, and then using those results to construct what you described.

is the motivation that you have many more properties than members?

anyway - don't know. sounds like the kind of thing a biologist (or suitable mathematician) might, though.

posted by andrew cooke at 2:23 PM on November 28, 2005

Sounds like the sort of thing an prolog is good at. Prolog is a computer language/program to do logic programming, or more precisely predicate calculus.

One offshoot of these ideas that is widely used are inference engines, or rule engines. An inference engine not only figures out if a particular rule is true ro false, but figures out how to organize multiple rules to check them all efficiently.

posted by gus at 2:26 PM on November 28, 2005

One offshoot of these ideas that is widely used are inference engines, or rule engines. An inference engine not only figures out if a particular rule is true ro false, but figures out how to organize multiple rules to check them all efficiently.

posted by gus at 2:26 PM on November 28, 2005

It almost sounds like you are trying to do cluster analysis - like k-means clustering. ANOVA in reverse.

posted by milkrate at 2:28 PM on November 28, 2005

posted by milkrate at 2:28 PM on November 28, 2005

A point of clarification: By "a set of properties that elements of Z have in common but which elements of X-Z do not", do you mean that _no_ memeber of X-Z has this property, or simply that it is not the case that _all_ members of X-Z have this property but it is the case that all members of Z do?

i.e. Are you looking for:

y such that ForAll z in Z y(z) AND ForAll x in X-Z !y(x)

or are you looking for

y such that ForAll z in Z y(z) AND ThereExists x in X-Z such that !y(x)

posted by lucasks at 2:39 PM on November 28, 2005

i.e. Are you looking for:

y such that ForAll z in Z y(z) AND ForAll x in X-Z !y(x)

or are you looking for

y such that ForAll z in Z y(z) AND ThereExists x in X-Z such that !y(x)

posted by lucasks at 2:39 PM on November 28, 2005

If I am understanding you correctly, you're talking about the non-intersection of a set Z and its

So, say you have your three people, and you have two attributes, eye color and living in NY, that define subsets A and B respectively.

The union of these subsets is all those who are either green-eyed, New Yorkers, or both.

The intersection of these subsets is all green-eyed New Yorkers.

The relative complement A\B is all those who are green-eyed but do not live in New York.

The relative complement B\A is all those who live in NY but do not have green eyes.

posted by nekton at 2:42 PM on November 28, 2005

*relative*complement, X-Z.So, say you have your three people, and you have two attributes, eye color and living in NY, that define subsets A and B respectively.

The union of these subsets is all those who are either green-eyed, New Yorkers, or both.

The intersection of these subsets is all green-eyed New Yorkers.

The relative complement A\B is all those who are green-eyed but do not live in New York.

The relative complement B\A is all those who live in NY but do not have green eyes.

posted by nekton at 2:42 PM on November 28, 2005

RUO_X/andrew: I have my reasons for wanting to do it this way. I can't talk about the rest.

Thanks for the info- ANOVA and discriminant analysis look like good roads to head down.

Is this the kind of thing I'd expect a master's in comp sci to grok, or do I need to go to the straight up mathematics folks?

posted by mkultra at 2:43 PM on November 28, 2005

Thanks for the info- ANOVA and discriminant analysis look like good roads to head down.

Is this the kind of thing I'd expect a master's in comp sci to grok, or do I need to go to the straight up mathematics folks?

posted by mkultra at 2:43 PM on November 28, 2005

just to clarify this "reverse" thing that milkrate and i have mentioned (i think we're kind-of talking about the same thing). if you have N members and P properties, then the algorithm you describe means:

for each of 2^N different groups, check each property of each member to see which meet the all/none criteria. that's about 2^N * N * P operations.

in contrast, if you divide by each property in turn and record what groups you get, it's P * N operations. which is much much faster.

that's if you want to look at all groupings. but even if you only want to look at one combination, your way still costs N * P, which is the same price as finding all the groups if you do it the "right" way round.

(i may have ignored something significant in the cost of indexing and collating results, but it's not going to be much smaller than 2^N, which is exponantial - really bad).

posted by andrew cooke at 2:43 PM on November 28, 2005

for each of 2^N different groups, check each property of each member to see which meet the all/none criteria. that's about 2^N * N * P operations.

in contrast, if you divide by each property in turn and record what groups you get, it's P * N operations. which is much much faster.

that's if you want to look at all groupings. but even if you only want to look at one combination, your way still costs N * P, which is the same price as finding all the groups if you do it the "right" way round.

(i may have ignored something significant in the cost of indexing and collating results, but it's not going to be much smaller than 2^N, which is exponantial - really bad).

posted by andrew cooke at 2:43 PM on November 28, 2005

oh, sorry. missed that on preview.

posted by andrew cooke at 2:44 PM on November 28, 2005

posted by andrew cooke at 2:44 PM on November 28, 2005

lucasks- Ultimately, I'd like the latter, and a way to know the

nekton- Yeah, I guess you could ultimately break it down to that, but we're talkin' 'bout a whole lotta subsets...

posted by mkultra at 2:47 PM on November 28, 2005

*degree*to which it applies.nekton- Yeah, I guess you could ultimately break it down to that, but we're talkin' 'bout a whole lotta subsets...

posted by mkultra at 2:47 PM on November 28, 2005

Yeah, it sounds like you want discriminant analysis which is a way of talking about similarities of populations. milkrate is right too: you should also look into hierarchical cluster analysis, a way of building similarity taxonomies from your data.

posted by bonehead at 3:50 PM on November 28, 2005

posted by bonehead at 3:50 PM on November 28, 2005

I'm a bit late here, but:

This does sound like a case for cluster analysis; these groups tend to segregate together. Discriminant analysis is one subset of cluster analysis, in the sense that it tells you what combination of parameters (eye color, new-yorkiness, etc.) best discriminate between your pre-defined classes of people.

If you're using quantifiable characteristics, I might also suggest a leverage analysis. In this, you essentially evaluate each variable on its own merits, while holding the effect of all other variables constant. This works best, of course, when you are using many observed variables (eye color, location, and body odor of various people) to predict one final variable (the amount of money they're willing to lend mkultra). Then you can look at the effect of a single variable on the result you're looking for.

The first thing to do, though, is to figure out the dimensionality of your data set. For that, Principal Component's analysis works well if you have many observations; discriminant analysis can then be applied to the observations in this reduced data set. If you're not

Do note this qualification of the ANOVA method for data mining (and inference in general):

To really be able to make predictions from your analysis, you'll need to do one of two things:

a) post-hoc tests such as Bonferroni corrections.

b) leverage tests to look at the effect of each variable ALONE.

c) hope.

The last stage of this kind of analysis is the hardest, at least in my experience. Often I lose a bit of faith in my results by doing the statistics to their natural conclusion.

All of which is really just musing. One final point, though:

There's a crucial distinction to be made between clustering or categorizing data, and predicting from those categorizations. If you want to be able to maximize your ability to predict a single response variable (willingness to lend to mk) from your multivariate observations, you really should do a different kind of analysis. Discriminant analysis can get you close, if your answer is readily classifiable into "clusters" of response. But if it's not, you'll need to use more powerful tools in regression and linear approximation. Let me know if I should talk about that.

<shill>Let me know if you want a consultant.</shill>

posted by metaculpa at 4:12 PM on November 28, 2005

This does sound like a case for cluster analysis; these groups tend to segregate together. Discriminant analysis is one subset of cluster analysis, in the sense that it tells you what combination of parameters (eye color, new-yorkiness, etc.) best discriminate between your pre-defined classes of people.

If you're using quantifiable characteristics, I might also suggest a leverage analysis. In this, you essentially evaluate each variable on its own merits, while holding the effect of all other variables constant. This works best, of course, when you are using many observed variables (eye color, location, and body odor of various people) to predict one final variable (the amount of money they're willing to lend mkultra). Then you can look at the effect of a single variable on the result you're looking for.

The first thing to do, though, is to figure out the dimensionality of your data set. For that, Principal Component's analysis works well if you have many observations; discriminant analysis can then be applied to the observations in this reduced data set. If you're not

*totally*swamped by the number of observations per individual, ANOVA can work well.Do note this qualification of the ANOVA method for data mining (and inference in general):

*Specifically, one can ask whether or not two or more groups are significantly different from each other with respect to the mean of a particular variable. ... However, it should be clear that, if the means for a variable are significantly different in different groups, then we can say that this variable discriminates between the groups.*viaTo really be able to make predictions from your analysis, you'll need to do one of two things:

a) post-hoc tests such as Bonferroni corrections.

b) leverage tests to look at the effect of each variable ALONE.

c) hope.

The last stage of this kind of analysis is the hardest, at least in my experience. Often I lose a bit of faith in my results by doing the statistics to their natural conclusion.

All of which is really just musing. One final point, though:

There's a crucial distinction to be made between clustering or categorizing data, and predicting from those categorizations. If you want to be able to maximize your ability to predict a single response variable (willingness to lend to mk) from your multivariate observations, you really should do a different kind of analysis. Discriminant analysis can get you close, if your answer is readily classifiable into "clusters" of response. But if it's not, you'll need to use more powerful tools in regression and linear approximation. Let me know if I should talk about that.

<shill>Let me know if you want a consultant.</shill>

posted by metaculpa at 4:12 PM on November 28, 2005

Just to add: validity in this analysis really is hugely dependent on how you're creating subset Z. If it's a random sampling, you simply need to do some calculations to ensure that your power is adequate to form a statistically valid conclusion; your power will be based on the population size and the number of analyses you are intending to perform.

If, however, you are selecting Z from the members of X in any way that is nonrandom, you need to do more work to ensure that your selection process isn't somehow introducing bias into the distribution of the attributes in Y which you are analyzing. These techniques reduce power considerably, requiring the population size to be larger than the above case where Z is selected randomly from X.

In certain sorts of medical research, Cox proportional hazards ratios are used to deal with the latter issues. But - and I cannot emphasize this highly enough - you have to design your methods

I'm assuming, of course, that you actually care about obtaining valid results - that the results matter - and you are not just intending to go through the motions of statistical analyses for reasons of showmanship or idle curiosity. This is a tricky business. Consult a statistician.

posted by ikkyu2 at 12:44 PM on November 29, 2005

If, however, you are selecting Z from the members of X in any way that is nonrandom, you need to do more work to ensure that your selection process isn't somehow introducing bias into the distribution of the attributes in Y which you are analyzing. These techniques reduce power considerably, requiring the population size to be larger than the above case where Z is selected randomly from X.

In certain sorts of medical research, Cox proportional hazards ratios are used to deal with the latter issues. But - and I cannot emphasize this highly enough - you have to design your methods

**from the very beginning**with the final method of analysis in mind. If you do not do this - if you acquire a data set in some haphazard fashion without a clear idea of what you want to do with it - you**will**commit validity-destroying errors.I'm assuming, of course, that you actually care about obtaining valid results - that the results matter - and you are not just intending to go through the motions of statistical analyses for reasons of showmanship or idle curiosity. This is a tricky business. Consult a statistician.

posted by ikkyu2 at 12:44 PM on November 29, 2005

« Older Out, demons! (Spitting tips requested.) | What to do in Key Biscayne/Miami in January? Newer »

This thread is closed to new comments.

posted by Rothko at 2:04 PM on November 28, 2005