# GLM Variable Weighting

March 15, 2007 9:24 PM Subscribe

Statistics. Generalized Linear Models; What's being done in this R script I've been handed down, with the variable weights?

I've been handed an ancient and archaic R script that I'm using to do some variable selection, using generalized linear models. I'm trying to work out what it's doing, and if there's some

Essentially, it runs a GLM on a null model, a model saturated with all variables, a series of models that consist of the null model

It calculates the % deviance explained for each of these models.

It then calculates the "change of deviance" for each model - in the case of the "null + variable" models, this is the additional deviance explained by the model

In other words, it's basically working out how much of a difference each variable makes to the model; adding a variable to a null model might produce a great increase in deviance explained, but subtracting that variable from a saturated model might decrease the deviance explained only a little, because other variables still in the saturated model might still contain the information of the missing variable. This variable, therefore, has less explanatory power than a different variable that contains unique information.

After it's calculated all these "changes in deviances", it somehow converts all this into a weight for each variable (this is where the R-script loses me, and I can't figure out what it's doing). In any case, it gives output like this. This "+" and "-" signs indicate if it's reporting a null model

I understand the output. And it's useful. I can post excerpts of the R script if required. But what's this

I've been handed an ancient and archaic R script that I'm using to do some variable selection, using generalized linear models. I'm trying to work out what it's doing, and if there's some

*name*for the method of variable weighting it's using.

Essentially, it runs a GLM on a null model, a model saturated with all variables, a series of models that consist of the null model

**with**each variable (ie. each variable on it's own), and a series of models that consist of the saturated model

**without**each variable.

It calculates the % deviance explained for each of these models.

It then calculates the "change of deviance" for each model - in the case of the "null + variable" models, this is the additional deviance explained by the model

*with*the variable, as compared to the null model. In the case of the "saturated - variable" models, it's the decrease in deviance explained of the model without the variable, compared to the saturated model.

In other words, it's basically working out how much of a difference each variable makes to the model; adding a variable to a null model might produce a great increase in deviance explained, but subtracting that variable from a saturated model might decrease the deviance explained only a little, because other variables still in the saturated model might still contain the information of the missing variable. This variable, therefore, has less explanatory power than a different variable that contains unique information.

After it's calculated all these "changes in deviances", it somehow converts all this into a weight for each variable (this is where the R-script loses me, and I can't figure out what it's doing). In any case, it gives output like this. This "+" and "-" signs indicate if it's reporting a null model

**+**the variable, or the saturated model

**-**the variable:

pcdev ch.dev +DWMeanMin 21.0156 21.0156 +DWMeanTemp 8.7369 8.7369 +DW3pmTemp 3.5637 3.5637 -DWMeanMin 32.5202 2.3141 -DWMeanTemp 32.9797 1.8546 -DWMeanMax 33.0875 1.7468 -DW3pmTemp 33.8545 0.9798 +DWMeanMax 0.4372 0.4372 Weights: DWMeanMin DWMeanTemp DW3pmTemp DWMeanMax 0.5739 0.2606 0.1118 0.0537Which is saying that variable "DWMeanMin" is really useful, containing a lot of information the other variables don't. "DWMeanTemp" might also be useful. "DW3pmTemp" is a long shot, and "DWMeanMax" doesn't contribute much explanatory power at all.

I understand the output. And it's useful. I can post excerpts of the R script if required. But what's this

**method**of variable selection called?

Hi Jimbob, I don't have much of a statistics background but I took a Machine Learning class last year and what this program is doing sounds a lot like a primitive form of Linear Discriminant Analysis.

I'm going to stick my head out and suggest that you implement a linear discriminant analysis algorithm (or download one of the many available on the Internet) and see how it performs for your data.

posted by onalark at 10:47 PM on March 15, 2007

I'm going to stick my head out and suggest that you implement a linear discriminant analysis algorithm (or download one of the many available on the Internet) and see how it performs for your data.

posted by onalark at 10:47 PM on March 15, 2007

If I'm reading you right, this sounds like it's looking for relative importance of variables in the model by first and last order of entry. There is a formal r package relaimpo that will do this for you, and includes some alternative and newer ideas/algorithms for how to think about this. You might want to check it out and compare the results you get with what your script does.

posted by shelbaroo at 8:02 AM on March 17, 2007

posted by shelbaroo at 8:02 AM on March 17, 2007

This thread is closed to new comments.

posted by Jimbob at 10:04 PM on March 15, 2007