Advertise here: Contact FM.


Correlation measurements of unequal vectors?
May 1, 2008 8:07 PM   RSS feed for this thread Subscribe

Can one perform Pearson's correlation or a variant with unequal numbers of rows?

Is there a variation of Pearson's correlation (or another correlation measurement) that I can use for two vectors X and Y, which have unequal numbers of rows?

Likewise, if I have two sets of data (genomic sequences) that can be "centered", is it reasonable to throw away data at the "edges" which are in X but not in Y, if the edge data do not contribute greatly to the mean and variance? I see this option in R, for example, but am curious about the real-world side effects.
posted by Blazecock Pileon to science & nature (7 comments total) 1 user marked this as a favorite
I think you need to define more carefully what you are looking for. Pearson's correlation is the covariance of two variables, divided by the individual standard deviations. Covariance is defined in terms of pairs, and doesn't really have a semantic meaning outside of that.

IE, if you're looking for the correlation between height and weight, you'll measure a bunch of individual's heights and weights, and look at the relationship between the two. If you have a weight measurement without the corresponding height measurement, or vice versa, that's useless for determining correlations.
posted by bsdfish at 9:45 PM on May 1


No, because by definition there can be no "correlation" if data for one of the variables is missing for certain cases. There are a number of ways to estimate the values for missing data, and these are routinely employed in situations like yours.
posted by Crotalus at 10:12 PM on May 1


Oh, and to piggy back on bsdfish's response, if there is a non-random reason why certain people gave their weight but withheld their height, then the "real world side effects" would be a spurious relationship between the variables.
posted by Crotalus at 10:14 PM on May 1


There are a number of ways to estimate the values for missing data, and these are routinely employed in situations like yours.

What guides the decision to estimate or to truncate where null values exist in pairs? R's default is to omit ("truncate").
posted by Blazecock Pileon at 10:19 PM on May 1


What guides the decision to estimate or to truncate where null values exist in pairs? R's default is to omit ("truncate").

Your judgement. Face validity. Do you have any reason to believe that there is a systematic reason why certain cases have missing data? My decision making process in your case would probably be along these lines: If I think an external reviewer is more likely to tank my article because of too many dropped cases than because I estimated missing data, then I'll estimate. Otherwise I won't. How's that for "real world"?
posted by Crotalus at 10:31 PM on May 1 [1 favorite]


That's about as real world as it gets. Thanks for the advice.
posted by Blazecock Pileon at 10:42 PM on May 1


If you have just two variables, I'd be hard-pressed for a good reason to use imputation and estimate something for one.
posted by a robot made out of meat at 4:22 AM on May 2


« Older What Linux Should I Use?-Also ...   |   Asking for Mr. Oflinkey: Wher... Newer »

You are not logged in, either login or create an account to post comments



Related Questions
Math is cool, right? March 27, 2008
Beautiful Equations February 29, 2008
How do we establish causation? January 8, 2008
StatisticsFilter: Correlation over time? December 11, 2007
This Dummy needs a guide to learn Bayesian... May 8, 2007