# Correlation measurements of unequal vectors?

May 1, 2008 8:07 PM Subscribe

Can one perform Pearson's correlation or a variant with unequal numbers of rows?

Is there a variation of Pearson's correlation (or another correlation measurement) that I can use for two vectors X and Y, which have unequal numbers of rows?

Likewise, if I have two sets of data (genomic sequences) that can be "centered", is it reasonable to throw away data at the "edges" which are in X but not in Y, if the edge data do not contribute greatly to the mean and variance? I see this option in R, for example, but am curious about the real-world side effects.

Is there a variation of Pearson's correlation (or another correlation measurement) that I can use for two vectors X and Y, which have unequal numbers of rows?

Likewise, if I have two sets of data (genomic sequences) that can be "centered", is it reasonable to throw away data at the "edges" which are in X but not in Y, if the edge data do not contribute greatly to the mean and variance? I see this option in R, for example, but am curious about the real-world side effects.

No, because by definition there can be no "correlation" if data for one of the variables is missing for certain cases. There are a number of ways to estimate the values for missing data, and these are routinely employed in situations like yours.

posted by Crotalus at 10:12 PM on May 1, 2008

posted by Crotalus at 10:12 PM on May 1, 2008

Oh, and to piggy back on bsdfish's response, if there is a non-random reason why certain people gave their weight but withheld their height, then the "real world side effects" would be a spurious relationship between the variables.

posted by Crotalus at 10:14 PM on May 1, 2008

posted by Crotalus at 10:14 PM on May 1, 2008

*There are a number of ways to estimate the values for missing data, and these are routinely employed in situations like yours.*

What guides the decision to estimate or to truncate where null values exist in pairs? R's default is to omit ("truncate").

posted by Blazecock Pileon at 10:19 PM on May 1, 2008

*What guides the decision to estimate or to truncate where null values exist in pairs? R's default is to omit ("truncate").*

Your judgement. Face validity. Do you have any reason to believe that there is a systematic reason why certain cases have missing data? My decision making process in your case would probably be along these lines: If I think an external reviewer is more likely to tank my article because of too many dropped cases than because I estimated missing data, then I'll estimate. Otherwise I won't. How's that for "real world"?

posted by Crotalus at 10:31 PM on May 1, 2008 [1 favorite]

That's about as real world as it gets. Thanks for the advice.

posted by Blazecock Pileon at 10:42 PM on May 1, 2008

posted by Blazecock Pileon at 10:42 PM on May 1, 2008

If you have just two variables, I'd be hard-pressed for a good reason to use imputation and estimate something for one.

posted by a robot made out of meat at 4:22 AM on May 2, 2008

posted by a robot made out of meat at 4:22 AM on May 2, 2008

« Older What Linux Distro Should i Use/Everything else you... | Enough with the Calvins wizzing on the Ford/Chevy... Newer »

This thread is closed to new comments.

IE, if you're looking for the correlation between height and weight, you'll measure a bunch of individual's heights and weights, and look at the relationship between the two. If you have a weight measurement without the corresponding height measurement, or vice versa, that's useless for determining correlations.

posted by bsdfish at 9:45 PM on May 1, 2008