Correlation measurements of unequal vectors?
May 1, 2008 8:07 PM   Subscribe

Can one perform Pearson's correlation or a variant with unequal numbers of rows?

Is there a variation of Pearson's correlation (or another correlation measurement) that I can use for two vectors X and Y, which have unequal numbers of rows?

Likewise, if I have two sets of data (genomic sequences) that can be "centered", is it reasonable to throw away data at the "edges" which are in X but not in Y, if the edge data do not contribute greatly to the mean and variance? I see this option in R, for example, but am curious about the real-world side effects.
posted by Blazecock Pileon to Science & Nature (7 answers total) 1 user marked this as a favorite
 
I think you need to define more carefully what you are looking for. Pearson's correlation is the covariance of two variables, divided by the individual standard deviations. Covariance is defined in terms of pairs, and doesn't really have a semantic meaning outside of that.

IE, if you're looking for the correlation between height and weight, you'll measure a bunch of individual's heights and weights, and look at the relationship between the two. If you have a weight measurement without the corresponding height measurement, or vice versa, that's useless for determining correlations.
posted by bsdfish at 9:45 PM on May 1, 2008


No, because by definition there can be no "correlation" if data for one of the variables is missing for certain cases. There are a number of ways to estimate the values for missing data, and these are routinely employed in situations like yours.
posted by Crotalus at 10:12 PM on May 1, 2008


Oh, and to piggy back on bsdfish's response, if there is a non-random reason why certain people gave their weight but withheld their height, then the "real world side effects" would be a spurious relationship between the variables.
posted by Crotalus at 10:14 PM on May 1, 2008


Response by poster: There are a number of ways to estimate the values for missing data, and these are routinely employed in situations like yours.

What guides the decision to estimate or to truncate where null values exist in pairs? R's default is to omit ("truncate").
posted by Blazecock Pileon at 10:19 PM on May 1, 2008


Best answer: What guides the decision to estimate or to truncate where null values exist in pairs? R's default is to omit ("truncate").

Your judgement. Face validity. Do you have any reason to believe that there is a systematic reason why certain cases have missing data? My decision making process in your case would probably be along these lines: If I think an external reviewer is more likely to tank my article because of too many dropped cases than because I estimated missing data, then I'll estimate. Otherwise I won't. How's that for "real world"?
posted by Crotalus at 10:31 PM on May 1, 2008 [1 favorite]


Response by poster: That's about as real world as it gets. Thanks for the advice.
posted by Blazecock Pileon at 10:42 PM on May 1, 2008


If you have just two variables, I'd be hard-pressed for a good reason to use imputation and estimate something for one.
posted by a robot made out of meat at 4:22 AM on May 2, 2008


« Older What Linux Distro Should i Use/Everything else you...   |   Enough with the Calvins wizzing on the Ford/Chevy... Newer »
This thread is closed to new comments.