Statistics. Generalized Linear Models; What's being done in this R script I've been handed down, with the variable weights?
I've been handed an ancient and archaic R script that I'm using to do some variable selection, using generalized linear models. I'm trying to work out what it's doing, and if there's some
name for the method of variable weighting it's using.
Essentially, it runs a GLM on a null model, a model saturated with all variables, a series of models that consist of the null model
with each variable (ie. each variable on it's own), and a series of models that consist of the saturated model
without each variable.
It calculates the % deviance explained for each of these models.
It then calculates the "change of deviance" for each model - in the case of the "null + variable" models, this is the additional deviance explained by the model
with the variable, as compared to the null model. In the case of the "saturated - variable" models, it's the decrease in deviance explained of the model without the variable, compared to the saturated model.
In other words, it's basically working out how much of a difference each variable makes to the model; adding a variable to a null model might produce a great increase in deviance explained, but subtracting that variable from a saturated model might decrease the deviance explained only a little, because other variables still in the saturated model might still contain the information of the missing variable. This variable, therefore, has less explanatory power than a different variable that contains unique information.
After it's calculated all these "changes in deviances", it somehow converts all this into a weight for each variable (this is where the R-script loses me, and I can't figure out what it's doing). In any case, it gives output like this. This "+" and "-" signs indicate if it's reporting a null model
+ the variable, or the saturated model
- the variable:
pcdev ch.dev
+DWMeanMin 21.0156 21.0156
+DWMeanTemp 8.7369 8.7369
+DW3pmTemp 3.5637 3.5637
-DWMeanMin 32.5202 2.3141
-DWMeanTemp 32.9797 1.8546
-DWMeanMax 33.0875 1.7468
-DW3pmTemp 33.8545 0.9798
+DWMeanMax 0.4372 0.4372
Weights:
DWMeanMin DWMeanTemp DW3pmTemp DWMeanMax
0.5739 0.2606 0.1118 0.0537
Which is saying that variable "DWMeanMin" is really useful, containing a lot of information the other variables don't. "DWMeanTemp" might also be useful. "DW3pmTemp" is a long shot, and "DWMeanMax" doesn't contribute much explanatory power at all.
I understand the output. And it's useful. I can post excerpts of the R script if required. But what's this
method of variable selection called?
posted by Jimbob at 10:04 PM on March 15, 2007