# Statistics: Restricted range and linearity/non-linearity.

May 31, 2011 3:40 PM Subscribe

StatisticsFilter: Range restriction and the strength of linear and non-linear relationships. Help?

Let's say I measure two things (e.g., height and cancer rates), and I expect the relationship between them to be cubic or quadratic.

If it is known that restricting the range of one or both variables will reduce the correlation coefficient (e.g., as this study shows), then does it follow that the strength of non-linear functions would also be reduced? If I have a restricted range when sampling height, will I be less likely to observe a quadratic function with cancer rate? Would that translate into a greater likelihood of observing a linear relationship? Why or why not?

Let's say I measure two things (e.g., height and cancer rates), and I expect the relationship between them to be cubic or quadratic.

If it is known that restricting the range of one or both variables will reduce the correlation coefficient (e.g., as this study shows), then does it follow that the strength of non-linear functions would also be reduced? If I have a restricted range when sampling height, will I be less likely to observe a quadratic function with cancer rate? Would that translate into a greater likelihood of observing a linear relationship? Why or why not?

If I understand your question correctly, then "no".

Picture a normal distribution.

Looking at it as a linear function on the range ±∞, you have a correlation coefficient approaching 1 (with a little blip centered at the mean of the function). Reduce that range to ±3σ, and you have a very poor correlation indeed.

Now look at it as a quadratic (1-x²). Over ±∞, you have almost a zero correlation coefficient. On ±3σ, you end up with something that correlates fairly well... Not exactly perfect, but a good 0.8 or so.

With higher order polynomials, this effect grows more pronounced. You can model

posted by pla at 6:19 PM on May 31, 2011

Picture a normal distribution.

Looking at it as a linear function on the range ±∞, you have a correlation coefficient approaching 1 (with a little blip centered at the mean of the function). Reduce that range to ±3σ, and you have a very poor correlation indeed.

Now look at it as a quadratic (1-x²). Over ±∞, you have almost a zero correlation coefficient. On ±3σ, you end up with something that correlates fairly well... Not exactly perfect, but a good 0.8 or so.

With higher order polynomials, this effect grows more pronounced. You can model

**any**arbitrary set of n points with an order n-1 polynomial, while that same set of points may look like nothing but pure noise to n-2 and below. And I suspect that counts as the point of this question - If you overfit your model to your data, you can**always**get a perfect fit.posted by pla at 6:19 PM on May 31, 2011

If you're asking whether restricting the range will increase the

posted by svenx at 6:55 PM on May 31, 2011

*correlation*, which measures a linear relationship, then pla's answer is right. But based on your question, it looks like you're actually asking about the effects on the*fit*to a non-linear function, in which case I think my answer makes more sense.posted by svenx at 6:55 PM on May 31, 2011

I think there might be an ambiguity in the question.. You wrote:

If you meant "If it is known [..] for

If instead you meant "It is known [...]

posted by JumpW at 9:28 PM on May 31, 2011

*If it is known that restricting the range of one or both variables will reduce the correlation coefficient*If you meant "If it is known [..] for

**a particular dataset**, and I'm going to fit a quadratic regression to that dataset", then I think the issue is complicated and depends on your specific dataset; also I'm not sure why you would measure a correlation of a quadratic relationship, which I think is the reasoning of pla's answer.If instead you meant "It is known [...]

**from past examples with other datasets**, and I'm going to fit a quadratic regression to an unrelated dataset", then I think svenx's answer is correct, *in general*.... Note it might be the case that you have some datapoints that are outliers in both x and y, and reducing your range would drop these points from consideration and improve the fit of the quadratic relationship.posted by JumpW at 9:28 PM on May 31, 2011

Restricting the sampling of the IV will have a substantial change on non-linear fits in finite samples because functions which are different outside of the restriction can look the same inside the restriction.

Restricting the DV is more complicated and depends on how you do it. Imagine a scatter-plot of a linear function y= x +e. When you truncate y at +-1, around say x=1.2 the only points which you get to see are ones where e was negative, so if you just plop a best-fit line on top of it, those points which are closer to y=0 than the true average will drag the slope down. The same bias will obviously happen with more complicated non-linear fits, but exactly how it will play out is hard to say. For example, the procedure might draw a fit which is about constant y=.9 for x > 1, and produce an excellent residual sum of squares.

Note that the answer "looked" different for x and y above, whereas they were symmetric in the paper you linked. The framework of (x,y) jointly non-independently distributed can be different from the model (y|x).

Another note is that the above was for truncation (which is what happened in your paper), but the answer is very different when you have censoring. For example if instead of points (1,1.2) going away returned (1,>1) you have hope for recovering the true relationship. That may be what you have in cancer data where the rate may be estimated as zero or "small", but the data didn't disappear.

posted by a robot made out of meat at 6:53 AM on June 1, 2011

Restricting the DV is more complicated and depends on how you do it. Imagine a scatter-plot of a linear function y= x +e. When you truncate y at +-1, around say x=1.2 the only points which you get to see are ones where e was negative, so if you just plop a best-fit line on top of it, those points which are closer to y=0 than the true average will drag the slope down. The same bias will obviously happen with more complicated non-linear fits, but exactly how it will play out is hard to say. For example, the procedure might draw a fit which is about constant y=.9 for x > 1, and produce an excellent residual sum of squares.

Note that the answer "looked" different for x and y above, whereas they were symmetric in the paper you linked. The framework of (x,y) jointly non-independently distributed can be different from the model (y|x).

Another note is that the above was for truncation (which is what happened in your paper), but the answer is very different when you have censoring. For example if instead of points (1,1.2) going away returned (1,>1) you have hope for recovering the true relationship. That may be what you have in cancer data where the rate may be estimated as zero or "small", but the data didn't disappear.

posted by a robot made out of meat at 6:53 AM on June 1, 2011

I just wanted to say thanks to everyone for their input!

posted by tybeet at 10:36 AM on July 13, 2011

posted by tybeet at 10:36 AM on July 13, 2011

This thread is closed to new comments.

posted by svenx at 4:32 PM on May 31, 2011