March 18, 2012 7:36 AM Subscribe

Stats filter. I am doing multivariate regression for the first time and I want to understand what I'm doing, having gone beyond my formal training. I have many possible ways I could formulate the regression (different variables to include) and I want to find a model that is both the best possible fit while using the fewest number of variables. How?

I have what seems like a straightforward regression set up. A given object i has inputs x_1 to x_n with coefficients A_1 to A_n, and output y_i. (I think this is multivariate regression, rather than multiple regression, because there are multiple outputs. Correct me if I'm wrong.)

In the first formulation of this problem the output is almost a direct linear combination some of the inputs (x), and I can use average coefficients which have some empirical basis. The problem I want to solve is that the inputs x_i have to be measured empirically, and some of these are much easier to collect than others -- also some of them are related to others, they are not entirely independent, and in fact some of them may be almost fully correlated -- and I want to show a good way to approximate y_i using only a few of the inputs. In particular, I want to determine analytically the best such model out of several possible models.

The answer has something to do with correlation coefficients and residuals and r-squared. I've been playing around in Matlab and am getting somewhere, but I don't have a high-level procedure in mind - just fiddling around and not converging on an answer. Can you walk me through the procedure you would use to test different models, identify which variables matter and which ones can be thrown out, and demonstrate that you have arrived at a good answer?
posted by PercussivePaul to Science & Nature (12 answers total) 5 users marked this as a favorite

I have what seems like a straightforward regression set up. A given object i has inputs x_1 to x_n with coefficients A_1 to A_n, and output y_i. (I think this is multivariate regression, rather than multiple regression, because there are multiple outputs. Correct me if I'm wrong.)

In the first formulation of this problem the output is almost a direct linear combination some of the inputs (x), and I can use average coefficients which have some empirical basis. The problem I want to solve is that the inputs x_i have to be measured empirically, and some of these are much easier to collect than others -- also some of them are related to others, they are not entirely independent, and in fact some of them may be almost fully correlated -- and I want to show a good way to approximate y_i using only a few of the inputs. In particular, I want to determine analytically the best such model out of several possible models.

The answer has something to do with correlation coefficients and residuals and r-squared. I've been playing around in Matlab and am getting somewhere, but I don't have a high-level procedure in mind - just fiddling around and not converging on an answer. Can you walk me through the procedure you would use to test different models, identify which variables matter and which ones can be thrown out, and demonstrate that you have arrived at a good answer?

Regardless of whether the variables are themselves correlated (which they are in many cases), the two things you are asking for conflict with one another. For a large data set, the very best predictive model might have lots of variables, but not only that -- it may benefit from include interaction terms that multiply variables together, or transformed terms such as squared, cubed, ln, roots, etc.

The first thing I'd ask myself is whether the demands for simplicity in the model call for not having more complex terms (ie not just A_1*x_1 + A_2*x_2... but A_1*x_1 + A_2*x_2 + A3*x_1*x_2 + A4*x_1^2 + A5*ln(x_1)...). In many real-world cased predictive models might benefit a great deal in their accuracy by adding non-linear terms for stronger variables moreso than they would by adding additional linear variables that have less impact on the model.

Next, ask yourself how big the model can be. Some of that might relate to the amount of data you have and the need to avoid overfitting. Some might relate to your need for simplicity and limited inputs. Noone can answer this question but you.

Finally, it would be important to know whether all you care about is predicting Y, or if you plan to make inferences about the predictor x's (this speaks to what k8t is talking about).

If you can define roughly how big the model should be for your needs, and whether you would be willing to add more complex terms, the rest is simple -- and refers to the subject of "Model Selection" in regression. Software can then be given your constraints and then rank all models based on a variety of "goodness of fit" statistics (chi square, Akaike information criterion, Bayesian information criterion, deviance information criterion, etc.).

posted by drpynchon at 9:39 AM on March 18, 2012

It might help to be more specific, then. I think the problem is towards the simpler end of what you are imagining. The simple matter of Model Selection in regression is what I'm asking about. I've never had a proper stats course so I don't know how to do model selection.

I am trying to estimate carbon footprints of physical products. Each product has a full listing of parts, like kg of steel, kg of plastic, etc (hundreds of them). The output is calculated analytically using the full listings of parts - I've done this already. The parts can be grouped together so that each product is decomposed into 5 or 6 systems, by mass. These are my input variables x - how much of each system is present in each product i. I want to produce a model that estimates the output given a description of the product in terms of these 5 or 6 systems. So far this is straightforward because I have enough information from my calculation of the output to make an analytical approximation of what the coefficients for these systems should be.

I have two lines of inquiry here. First I have other parameters I can introduce like mass, dimensions, purchase price, etc which were not part of the original calculation but may have some predictive power, and I want to try adding some of these to the model. Second, I want to see how simple I can make my model and still get a reasonably good result. Let's say I choose some maximum residual I am willing to accept, and want to produce the simplest possible combination of variables that meets that target. I've already noticed I get an okay result just using mass alone and throwing out all of the other detailed information - a linear model with mass gives an R^2 of around 0.6. (Essentially, because heavier products have more parts in them, and more parts usually means a higher footprint -- mass is acting as a rough proxy.) My field tends to require first-order judgments as data is often scarce, and it would often be difficult to collect data on all of the x's, so a very simple model is valuable. Thus I want to explore the tradeoff between complexity and accuracy more systematically.

As I said I'm a bit out of my depth. I didn't know this was called 'model selection' (makes sense though). I think what I want to do is called 'stepwise selection' or 'stepwise regression'. Does that sound right? It doesn't look like Matlab has built-in functions for this, but I think R does and I have been meaning to try it (I don't have any other stats software).

posted by PercussivePaul at 10:21 AM on March 18, 2012

I am trying to estimate carbon footprints of physical products. Each product has a full listing of parts, like kg of steel, kg of plastic, etc (hundreds of them). The output is calculated analytically using the full listings of parts - I've done this already. The parts can be grouped together so that each product is decomposed into 5 or 6 systems, by mass. These are my input variables x - how much of each system is present in each product i. I want to produce a model that estimates the output given a description of the product in terms of these 5 or 6 systems. So far this is straightforward because I have enough information from my calculation of the output to make an analytical approximation of what the coefficients for these systems should be.

I have two lines of inquiry here. First I have other parameters I can introduce like mass, dimensions, purchase price, etc which were not part of the original calculation but may have some predictive power, and I want to try adding some of these to the model. Second, I want to see how simple I can make my model and still get a reasonably good result. Let's say I choose some maximum residual I am willing to accept, and want to produce the simplest possible combination of variables that meets that target. I've already noticed I get an okay result just using mass alone and throwing out all of the other detailed information - a linear model with mass gives an R^2 of around 0.6. (Essentially, because heavier products have more parts in them, and more parts usually means a higher footprint -- mass is acting as a rough proxy.) My field tends to require first-order judgments as data is often scarce, and it would often be difficult to collect data on all of the x's, so a very simple model is valuable. Thus I want to explore the tradeoff between complexity and accuracy more systematically.

As I said I'm a bit out of my depth. I didn't know this was called 'model selection' (makes sense though). I think what I want to do is called 'stepwise selection' or 'stepwise regression'. Does that sound right? It doesn't look like Matlab has built-in functions for this, but I think R does and I have been meaning to try it (I don't have any other stats software).

posted by PercussivePaul at 10:21 AM on March 18, 2012

R is by far the way to go. One method would indeed by step-wise selection (I would favor backward selection in most cases if you do this), but if you have a half decent computer for your purposes it would potentially be more instructive to have the software run ALL the possible models and rank them by some goodness of fit statistic (I'd go with AIC). This is sometimes referred to as something like "All Subsets" or "Best Subsets" selection methods. This PDF and this PDF are nice brief tutorials on the subject using R. Do seriously consider including non-linear terms (especially squared terms and root or ln terms) in the model space though. You'd be surprised how much better a fit you might get with even just two terms if you allow for a squared term and such.

Also, another thing that comes up that you might not have considered is taking this a step further and trying to actually cross-validate the model, which is really how you should be estimating the performance of a predictive model in terms of its potentially applicability to*another* similar dataset. If you have a decent sized data set, this would be well-advised.

posted by drpynchon at 10:48 AM on March 18, 2012

Also, another thing that comes up that you might not have considered is taking this a step further and trying to actually cross-validate the model, which is really how you should be estimating the performance of a predictive model in terms of its potentially applicability to

posted by drpynchon at 10:48 AM on March 18, 2012

It says you're a grad student in your profile, so I'd just strongly advise you to check out your campus stats help. You're asking a pretty big question that doesn't have, like, one right answer. There is more than one way to choose a model. Lots of people need to do regressions who haven't had a stats class and they do this by working with a trained statistician who will help them come up with a model (or two or three). And it'll be a lot easier to have this conversation in person.

posted by mandymanwasregistered at 10:48 AM on March 18, 2012 [1 favorite]

posted by mandymanwasregistered at 10:48 AM on March 18, 2012 [1 favorite]

It sounds like you might want to explore principal component analysis. Basically, the idea is to replace your n highly correlated independent variables with one or a few principal components vectors. The PCA diagonalization identifies certain linear combinations of the variables that are de-correlated. R has good built-in support for this. There's of course lots to think about in that there is no unambiguous advice that applies universally. And if you're estimating correlations from incomplete data data vectors (say you have one measurement of x1 and x2, another of x2, x3, and x4, and so on), you will want to be careful in how you go about doing things.

As for determining which model is best and how many variables to include... How deep do you want to go? You might try to start by looking into Bayesian model comparison. But if you have a lot of data, the simplest route may just be simulation.

posted by dsword at 7:23 AM on March 19, 2012

As for determining which model is best and how many variables to include... How deep do you want to go? You might try to start by looking into Bayesian model comparison. But if you have a lot of data, the simplest route may just be simulation.

posted by dsword at 7:23 AM on March 19, 2012

Unfortunately, variable selection is a complex topic. I highly, highly recommend that you get a consult if this is grad work related. For no other reason, when you write it up you will need to have literature to back up what you did besides AskMe. Other questions: when you say simple model, does it have to be simple in structure or simple in number of input variables? Is there an interpretation you're looking for, or do you just want to make predictions? Does the data you have match up well in terms of distribution of predictors to the data it'll be applied on? How much data are we talking about?

posted by a robot made out of meat at 8:34 AM on March 19, 2012

posted by a robot made out of meat at 8:34 AM on March 19, 2012

Let me see if I understand.

For some set of observations, you have their true carbon footprint and a long list of variables that go into making that footprint. kg of steel, etc.

But you want to work with other / more data where you're not going to have that long list of variables, so you need to estimate their carbon footprint from the data that you do have for them. (if this isn't true, and you have the full data and true carbon footprint for all observations, then for God's sake just use that) You will then be doing something else with these estimated carbon footprints. Is that right?

This is the sort of thing that's going to vary strongly by discipline, so I would suggest talking to methodologically-savvy people in your discipline. Even above talking to an actual statistician -- there's little use having a discussion that's about fitting data when it should be about causal inference, or in receiving advice that (a) is going to be immediately discounted by people above you in the great chain of being and (b) you're not in any position to really argue for other than "This guy said to do it this way."

My own perspective comes from my own discipline. For people like me,

(1) The best model is not the one with the best fit. The best model is the one that most correctly captures your theory, always. This is part of why, to people like me, stepwise regression is the devil, just atheoretic sloppiness -- I can't be bothered to think about what causes Y, so, fuck it, I'll just let the stupid computer tell me what its correlates are and say those must be the causal factors.

(2) The primary implication of this is that you should have a simple theory of where carbon footprint comes from and implement that. You've already done some of that by reasoning about mass, so do more of it. I would say that doing this is better than just fitting the shit out of your data, because you're likely to fit with some irrelevant variables whose noise meshes with the noise in your data. Or, more broadly, I would worry a lot that your model with a 0.995 R2 doesn't generalize well to the larger sample you care about unless it has a solid theoretical foundation. Lots of things in the universe are correlated with each other without any real relationship. Likewise I'd suggest choosing a function form on the basis of at least a little theoretic thought rather than just letting the data fit itself.

(3) You're going to worry that your estimated carbon footprints have error, which they will. I'd suggest a research design that looks at your expanded, estimated-carbon-footprint data *and* your original, true-carbon-footprint data with a lower N. Are your results consistent between the two, at least for your key inferences? Good. Another way to think about it "Let me show you results for a restricted sample with true carbon footprints... But you're worried about my small N. I can't replicate it fully, but here are some results with a larger dataset but only estimated carbon footprints that say the same thing. Here are some reasons to think that my estimates are a good proxy for true carbon footprints."

posted by ROU_Xenophobe at 9:08 AM on March 19, 2012

For some set of observations, you have their true carbon footprint and a long list of variables that go into making that footprint. kg of steel, etc.

But you want to work with other / more data where you're not going to have that long list of variables, so you need to estimate their carbon footprint from the data that you do have for them. (if this isn't true, and you have the full data and true carbon footprint for all observations, then for God's sake just use that) You will then be doing something else with these estimated carbon footprints. Is that right?

This is the sort of thing that's going to vary strongly by discipline, so I would suggest talking to methodologically-savvy people in your discipline. Even above talking to an actual statistician -- there's little use having a discussion that's about fitting data when it should be about causal inference, or in receiving advice that (a) is going to be immediately discounted by people above you in the great chain of being and (b) you're not in any position to really argue for other than "This guy said to do it this way."

My own perspective comes from my own discipline. For people like me,

(1) The best model is not the one with the best fit. The best model is the one that most correctly captures your theory, always. This is part of why, to people like me, stepwise regression is the devil, just atheoretic sloppiness -- I can't be bothered to think about what causes Y, so, fuck it, I'll just let the stupid computer tell me what its correlates are and say those must be the causal factors.

(2) The primary implication of this is that you should have a simple theory of where carbon footprint comes from and implement that. You've already done some of that by reasoning about mass, so do more of it. I would say that doing this is better than just fitting the shit out of your data, because you're likely to fit with some irrelevant variables whose noise meshes with the noise in your data. Or, more broadly, I would worry a lot that your model with a 0.995 R2 doesn't generalize well to the larger sample you care about unless it has a solid theoretical foundation. Lots of things in the universe are correlated with each other without any real relationship. Likewise I'd suggest choosing a function form on the basis of at least a little theoretic thought rather than just letting the data fit itself.

(3) You're going to worry that your estimated carbon footprints have error, which they will. I'd suggest a research design that looks at your expanded, estimated-carbon-footprint data *and* your original, true-carbon-footprint data with a lower N. Are your results consistent between the two, at least for your key inferences? Good. Another way to think about it "Let me show you results for a restricted sample with true carbon footprints... But you're worried about my small N. I can't replicate it fully, but here are some results with a larger dataset but only estimated carbon footprints that say the same thing. Here are some reasons to think that my estimates are a good proxy for true carbon footprints."

posted by ROU_Xenophobe at 9:08 AM on March 19, 2012

Exactly. Thank you ROU, that's very helpful. Part three is more or less what I've been imagining doing, but you've sketched it out nicely.

I'm not going to accept models that I can't provide a theoretical justification for. The problem was more that I lack the tools to compare different possible models. Like, one with mass only, and another with mass plus x1 and x2 -- how do I measure how good they are? (drpynchon's first link was very helpful and I think I understand this now). I think then I should step away from shotgun approaches and be more methodical about testing a smaller number of models that make sense analytically and intuitively.

I have raised the issue of a consult with my advisor and am getting some mild push-back ('I'm not sure you need it'). I would be happy to go this route, even though I had hoped to do it on my own. Thanks for the help.

posted by PercussivePaul at 11:56 AM on March 19, 2012

It's another thing that can vary by discipline. I don't know how much you need to worry that your estimator of carbon footprints is not the best possible simple estimator of carbon footprints.

I mean, say you have the third best. And that if God gave you the best and second best, your results would still be more or less the same as you got with your third-best. In that case, who gives a damn?

Having said that, another way to approach this is as a robustness question instead of model selection, especially if generating new estimates isn't computationally expensive. If you're worried about whether to just use mass or to include x1 and x2... fuck it. Do it with estimates from just mass. Do it again with estimates from mass and X1. Mass and x2. Mass and x1 and x2. Do they keep telling you the same thing? Then report one of them -- probably the theoretically simplest or otherwise "cleanest" -- and mention in a footnote that you created these estimates every sensible way you could think of and it didn't matter.

posted by ROU_Xenophobe at 1:31 PM on March 19, 2012

I mean, say you have the third best. And that if God gave you the best and second best, your results would still be more or less the same as you got with your third-best. In that case, who gives a damn?

Having said that, another way to approach this is as a robustness question instead of model selection, especially if generating new estimates isn't computationally expensive. If you're worried about whether to just use mass or to include x1 and x2... fuck it. Do it with estimates from just mass. Do it again with estimates from mass and X1. Mass and x2. Mass and x1 and x2. Do they keep telling you the same thing? Then report one of them -- probably the theoretically simplest or otherwise "cleanest" -- and mention in a footnote that you created these estimates every sensible way you could think of and it didn't matter.

posted by ROU_Xenophobe at 1:31 PM on March 19, 2012

(but again in other disciplines this might be anathema)

posted by ROU_Xenophobe at 1:31 PM on March 19, 2012

posted by ROU_Xenophobe at 1:31 PM on March 19, 2012

Actually, ROU, I was just pondering on the bus ride home that I should try to account for uncertainty in my outputs. Which means maybe running Monte Carlo sims and generating a big dataset with a range of estimates BEFORE I run the regressions. In that case I have a feeling the differences between many of the models would just be a wash, so pick the simplest. Robustness is a good way to approach it. (Maybe you should be on my committee.)

posted by PercussivePaul at 1:42 PM on March 19, 2012

posted by PercussivePaul at 1:42 PM on March 19, 2012

This thread is closed to new comments.

What variables matter? Well, that depends on your theoretical perspective. If previous theory says that age matters, put age in. Also remember that some of your variables may covary (like education and income).

Can you possible describe the variables that you have and if the nature of them?

posted by k8t at 9:03 AM on March 18, 2012