# corrcoeff

April 20, 2010 8:09 AM Subscribe

How can I compare a set of measurements ? I'm using MATLAB, but general statistics advice would be helpful.

I have a set of 10 measurements from one test subject stored as a vector, and I would like to compare them against a reference vector (with a corresponding 10 measurements), with some idea of how close the test subject is to the reference. The elements of the vector represent anthropomorphic data so for example element 1 might be height, element 2 might be shoulder width etc. And I want to try to test how close my test person's data is to the reference person.

My ideas so far:

One way would be loop through each test element in turn, and find the % difference from the reference element. I would then have 10 percentages, which I suppose I could average to get an idea of how closely the two vectors match.

This seems rather clumsy and inaccurate (I don't have much statistics experience).

MATLAB:

In MATLAB, I had been looking at that corrcoeff() function, but I suspect this may not be right for comparing data like this (is it for looking at a series of data maybe?).

If someone can provide general guidance on a method to do this, or (even better) if someone can suggest a MATLAB function or functions which are more useful for this, I'd be really grateful. Thanks.

I have a set of 10 measurements from one test subject stored as a vector, and I would like to compare them against a reference vector (with a corresponding 10 measurements), with some idea of how close the test subject is to the reference. The elements of the vector represent anthropomorphic data so for example element 1 might be height, element 2 might be shoulder width etc. And I want to try to test how close my test person's data is to the reference person.

My ideas so far:

One way would be loop through each test element in turn, and find the % difference from the reference element. I would then have 10 percentages, which I suppose I could average to get an idea of how closely the two vectors match.

This seems rather clumsy and inaccurate (I don't have much statistics experience).

MATLAB:

In MATLAB, I had been looking at that corrcoeff() function, but I suspect this may not be right for comparing data like this (is it for looking at a series of data maybe?).

If someone can provide general guidance on a method to do this, or (even better) if someone can suggest a MATLAB function or functions which are more useful for this, I'd be really grateful. Thanks.

If you just want to see how two vectors compare in a single number (ignoring the statistics), you'd probably want (A dot REF) / (REF dot REF).

posted by JMOZ at 8:28 AM on April 20, 2010

posted by JMOZ at 8:28 AM on April 20, 2010

I'm not sure what you want, but I use corrcoeff a lot.

I have X number of subjects who run in Y number of experiments. Each experiment yields Z measures of performance, so I have a X by Y by Z matrix of results. I usually average across Y, so I only have X by Z. I then run corrcoeff which yields results like the following: "Oh, when people have a high (measure A) they tend to have a low (measure B), a significant inverse correlation! And (measures C & D) correlate positively to p <>

(Ha! If only...)

So yeah, that doesn't sounds like what you want. It really depends on what question you want to answer. If you just want to get some value that represents the differences between the two vectors, your method will work fine. (You could also just subtract them, absolute value, and average the resulting vector. Without a loop it's much faster.) If you want to see if they're

posted by supercres at 8:41 AM on April 20, 2010

I have X number of subjects who run in Y number of experiments. Each experiment yields Z measures of performance, so I have a X by Y by Z matrix of results. I usually average across Y, so I only have X by Z. I then run corrcoeff which yields results like the following: "Oh, when people have a high (measure A) they tend to have a low (measure B), a significant inverse correlation! And (measures C & D) correlate positively to p <>

(Ha! If only...)

So yeah, that doesn't sounds like what you want. It really depends on what question you want to answer. If you just want to get some value that represents the differences between the two vectors, your method will work fine. (You could also just subtract them, absolute value, and average the resulting vector. Without a loop it's much faster.) If you want to see if they're

*significantly*different, well, it depends on a lot lot lot more.posted by supercres at 8:41 AM on April 20, 2010

You need to operationalize what you mean by "close". Your idea about % difference is one such operational definition, but probably not the best one. A common assumption in statistics is that measurements of the kind your talking about are generally normally distributed around the mean; if the reference vector you mention is supposed to reflect the mean height/shoulder width/whatever of the population, than you would expect approx 1/2 of the test subjects to be above and 1/2 below.

Corrcoeff is not what you want here, at least not on the raw measurement data. It might be okay if you normalized the measurements first.

posted by logicpunk at 8:41 AM on April 20, 2010

Corrcoeff is not what you want here, at least not on the raw measurement data. It might be okay if you normalized the measurements first.

posted by logicpunk at 8:41 AM on April 20, 2010

Oh I may have misread your goal here. You are not trying to compare your

I'm too rusty on my MatLab to tell you how that would work in practice, but it's probably a reasonable first step after actually looking at the data.

posted by drpynchon at 8:42 AM on April 20, 2010

*group*of 10 subjects, but are looking to see how each of these 10 compares to the reference group? In that case your sample size is limiting but really, it's the sample size of the vectors from the reference group that's the problem. The normal assumption is usually reasonable for most anthropometric measures. So I think the best you can do is to create a t-distribution for the reference group, and then calculate a percentile based on standard deviations from the mean for each of the elements of the test vectors. Essentially that would give you 10 new vectors in terms of standard deviations as opposed to absolute values, which would be a way of standardizing the relative difference of each of the elements.I'm too rusty on my MatLab to tell you how that would work in practice, but it's probably a reasonable first step after actually looking at the data.

posted by drpynchon at 8:42 AM on April 20, 2010

*less than*.05".

posted by supercres at 8:42 AM on April 20, 2010

1) The answer is 100% context-specific. % difference on an arbitrary initial scale does not mean anything.

2) A reference vector needs more like a reference population. If you could compare the percentiles in a reference population, that would be a little meaningful if you had the same amount of error across measurements, which you probably don't. Still doesn't get at the importance of the measurements, for example that two people identical except for the length of their appendix are more similar for most purposes than people identical except for their number of {something important to you}.

posted by a robot made out of meat at 9:01 AM on April 20, 2010

2) A reference vector needs more like a reference population. If you could compare the percentiles in a reference population, that would be a little meaningful if you had the same amount of error across measurements, which you probably don't. Still doesn't get at the importance of the measurements, for example that two people identical except for the length of their appendix are more similar for most purposes than people identical except for their number of {something important to you}.

posted by a robot made out of meat at 9:01 AM on April 20, 2010

*One way would be loop through each test element in turn, and find the % difference from the reference element. I would then have 10 percentages, which I suppose I could average to get an idea of how closely the two vectors match.*

for what it's worth, in matlab you don't need a loop to do this:

reference = [1 2 3 4 5]; sample = [6 7 8 9 0]; percents = (sample - reference) ./ reference;the dot in the division isn't a typo. here, the variable percents is a new vector populated by the percent differences. norm(percents) will tell you the magnitude of that 5-dimensional vector, which i suppose is a plausible way of comparing things. but there are lots of ways to compare things. i agree that the answer depends on what it is you want to do.

posted by sergeant sandwich at 9:44 AM on April 20, 2010

Maybe I'm misreading the question, but it seems to me that theyexpectresults has ten dimensions of difference and doesn't want ten dimensions of difference.

If you have lots of subjects, a common answer here is to use principal components or other data-reduction techniques to, well, reduce the data. Then he or she might have only a couple of (highly abstract) dimensions along which to compare each observation to the reference case.

posted by ROU_Xenophobe at 9:45 AM on April 20, 2010

If you have lots of subjects, a common answer here is to use principal components or other data-reduction techniques to, well, reduce the data. Then he or she might have only a couple of (highly abstract) dimensions along which to compare each observation to the reference case.

posted by ROU_Xenophobe at 9:45 AM on April 20, 2010

I don't think you will be able to make any "statistical" inferences here (e.g. of the type 'this person is statistically different than the reference condition') unless you know what the population variance is. For example, if your subject is 10% taller than the reference condition, you can't make any statements about the probability of such a measurement occurring without knowing the variability of human height overall.

I think %difference, as you've mentioned, is the best description you can have here.

Correlation would only be appropriate on a decently large sample (ie n > 10 or so, hopefully n>30, where each observation is independent) on two variables (ie the heights of 30 people vs. the weights of 30 people).

posted by JumpW at 9:46 AM on April 20, 2010

I think %difference, as you've mentioned, is the best description you can have here.

Correlation would only be appropriate on a decently large sample (ie n > 10 or so, hopefully n>30, where each observation is independent) on two variables (ie the heights of 30 people vs. the weights of 30 people).

posted by JumpW at 9:46 AM on April 20, 2010

Problem: there are an uncountably infinite number of metrics that could define how "close" two vectors are. You need to decide how you want to define "close". This may well be driven by the nature of the data: if you have population or large sample measures of the variance of each element, and/or covariance of each pair of elements, then you may want to use that information to define "close". For example, a difference in height of half an inch might matter more to you if the individuals are closer to the mean (or reference) value for height, and a small difference in height (from the reference) might matter less when the weight is closer to the reference value. But that's something that you have to decide up front - what do you want your measure of "close" to capture?

posted by dilettanti at 12:48 PM on April 20, 2010

posted by dilettanti at 12:48 PM on April 20, 2010

This sounds like the sort of multivariate analysis that is used in Anthropology. Like if you had a partial hominid specimen (like a jaw bone) and wanted to determine, in an objective fashion, whether it was

In it, they authors mention the RDV (randomization of 'distinctness values') test, "

Don't know how much this helps.

posted by kisch mokusch at 3:39 AM on April 21, 2010

*homo erectus*or*homo sapiens*(or whether it's too close to call). I have no idea how this is achieved, but I did find a paper that kind of illustrates what I'm getting at here.In it, they authors mention the RDV (randomization of 'distinctness values') test, "

*a non-parametric probabilistic approach that assesses whether the a priori fossil groups are random with respect to mandibular morphology. The RDV test assesses the cohesiveness of a group of individuals by calculating a 'distinctness value' (DV) defined by Sokal & Rohlf (1995, p. 806) as '... a measure of homogeneity or cohesion of the members of a group relative to their similarity with other groups'. The DV for a given group is calculated as the average correlation within the group (i.e. average of all pairwise within-group correlation coefficients) minus the average correlation between groups (i.e. average of all possible correlation coefficients between members and non-members). Hence, high positive DVs indicate that the chosen specimens form a distinct group (relative to the other specimens) in which members are more similar in shape to each other than they are to non-members. (Note that the use of correlation coefficients implicitly adjusts for scale, although it does not control for size-related shape variation.) Negative values indicate that members of the chosen group are generally more similar in shape to outside members than they are to one another.*"Don't know how much this helps.

posted by kisch mokusch at 3:39 AM on April 21, 2010

« Older If I send this voicemail to her new SO, what are... | I mean, other than changing their grades and... Newer »

This thread is closed to new comments.

The problem here is that you are going to be on rather thin statistical ice as it were if you only have a sample size of 10 per group and want to perform multiple statistical comparisons. One first step might be to consider an omnibus test for difference before bothering to look at the specific elements. This would be something like a Tukey's range test. However, your data set may not even meet the assumptions of that test to begin with.

posted by drpynchon at 8:23 AM on April 20, 2010