GLM Variable Weighting
March 15, 2007 9:24 PM   Subscribe

Statistics. Generalized Linear Models; What's being done in this R script I've been handed down, with the variable weights?

I've been handed an ancient and archaic R script that I'm using to do some variable selection, using generalized linear models. I'm trying to work out what it's doing, and if there's some name for the method of variable weighting it's using.

Essentially, it runs a GLM on a null model, a model saturated with all variables, a series of models that consist of the null model with each variable (ie. each variable on it's own), and a series of models that consist of the saturated model without each variable.

It calculates the % deviance explained for each of these models.

It then calculates the "change of deviance" for each model - in the case of the "null + variable" models, this is the additional deviance explained by the model with the variable, as compared to the null model. In the case of the "saturated - variable" models, it's the decrease in deviance explained of the model without the variable, compared to the saturated model.

In other words, it's basically working out how much of a difference each variable makes to the model; adding a variable to a null model might produce a great increase in deviance explained, but subtracting that variable from a saturated model might decrease the deviance explained only a little, because other variables still in the saturated model might still contain the information of the missing variable. This variable, therefore, has less explanatory power than a different variable that contains unique information.

After it's calculated all these "changes in deviances", it somehow converts all this into a weight for each variable (this is where the R-script loses me, and I can't figure out what it's doing). In any case, it gives output like this. This "+" and "-" signs indicate if it's reporting a null model + the variable, or the saturated model - the variable:
+DWMeanMin              21.0156 		21.0156
+DWMeanTemp             8.7369  		8.7369
+DW3pmTemp              3.5637  		3.5637
-DWMeanMin  		32.5202  		2.3141
-DWMeanTemp 		32.9797  		1.8546
-DWMeanMax  		33.0875  		1.7468
-DW3pmTemp  		33.8545  		0.9798
+DWMeanMax   		0.4372  		0.4372

 DWMeanMin 	DWMeanTemp  	DW3pmTemp  	DWMeanMax 
    0.5739     	0.2606     	0.1118     	0.0537 
Which is saying that variable "DWMeanMin" is really useful, containing a lot of information the other variables don't. "DWMeanTemp" might also be useful. "DW3pmTemp" is a long shot, and "DWMeanMax" doesn't contribute much explanatory power at all.

I understand the output. And it's useful. I can post excerpts of the R script if required. But what's this method of variable selection called?
posted by Jimbob to Science & Nature (3 answers total) 1 user marked this as a favorite
Response by poster: I've just figured out how the weights are calculated. Simple really. For each variable, the sum of the two change in deviance values, normalized to sum to 1.
posted by Jimbob at 10:04 PM on March 15, 2007

Hi Jimbob, I don't have much of a statistics background but I took a Machine Learning class last year and what this program is doing sounds a lot like a primitive form of Linear Discriminant Analysis.

I'm going to stick my head out and suggest that you implement a linear discriminant analysis algorithm (or download one of the many available on the Internet) and see how it performs for your data.
posted by onalark at 10:47 PM on March 15, 2007

If I'm reading you right, this sounds like it's looking for relative importance of variables in the model by first and last order of entry. There is a formal r package relaimpo that will do this for you, and includes some alternative and newer ideas/algorithms for how to think about this. You might want to check it out and compare the results you get with what your script does.
posted by shelbaroo at 8:02 AM on March 17, 2007

« Older Gimme The Blues!   |   The Wire locations Newer »
This thread is closed to new comments.