Principal components factor analysis and scale construction (PCFASCfilter) again
August 18, 2008 5:43 PM Subscribe
Principal components factor analysis and scale construction (filter) again.
When constructing a scale based on results of principle components factor analysis and components have differing signs, yet there is a theoretical reason for the variables to be coded as they are, what to do?
Example. I have four variables. For each of the three variables, higher scores indicate more "desirable" scores for societies on measures of development. The fourth variable is a macro-level variable that indicates the percent of a society that is comprised of people with a certain characteristic. The factor loadings look like this:
Var1: .7588
Var2: -.8355
Var3: .7138
Var4: -.7984
When constructing the scale, do I need to recode the variables with negative signs so that all items have positive loadings?
When constructing a scale based on results of principle components factor analysis and components have differing signs, yet there is a theoretical reason for the variables to be coded as they are, what to do?
Example. I have four variables. For each of the three variables, higher scores indicate more "desirable" scores for societies on measures of development. The fourth variable is a macro-level variable that indicates the percent of a society that is comprised of people with a certain characteristic. The factor loadings look like this:
Var1: .7588
Var2: -.8355
Var3: .7138
Var4: -.7984
When constructing the scale, do I need to recode the variables with negative signs so that all items have positive loadings?
The component is constructed as lambda1*x1+ lambda2*x2 etc. Eigenvalues can be positive or negative, which allows oppositely coded variable to coexist.
posted by a robot made out of meat at 5:54 PM on August 18, 2008
posted by a robot made out of meat at 5:54 PM on August 18, 2008
I'm not sure exactly what the problem is, but if you find principle components A and B from a higher dimensional dataset, you can use either +A or -A for your new +X axis in the new lower dimensional space and either +B or -B for the +Y axis. It doesn't affect the correctness of the PCA, it just transforms it into a mirrored space.
It makes sense to have the space correlate with your intuition about the data.
posted by demiurge at 5:57 PM on August 18, 2008
It makes sense to have the space correlate with your intuition about the data.
posted by demiurge at 5:57 PM on August 18, 2008
I'd check the intercorrelations of your variables. If some variables that you think should belong to the same scale are highly negatively correlated, that would cause problems for the scale you're trying to construct (and it would tell you something interesting about your data, at any rate).
posted by nixxon at 6:17 PM on August 18, 2008
posted by nixxon at 6:17 PM on August 18, 2008
Response by poster: Nixxon is on to what I might be thinking about here. Two of the variables are correlated -.5427 though I suspect this is being driven by an outlier (Saudi Arabia -- high GDP per capita, but low on many other measures). I dropped Saudi Arabia out but the correlation is still -5057. So, it's an interesting finding but totally counterintuitive to theoretical justifications.
posted by proj at 6:21 PM on August 18, 2008
posted by proj at 6:21 PM on August 18, 2008
No, no, no. PCA doesn't care about sign. The first principal component accumulates as much information from your data matrix as it's possible to capture in one dimension. It doesn't care which direction they're pointed when they start.
Let's say you have a survey with Likert scale items asking people to rate agreement from Strongly Agree to Strongly Disagree on a 5 point scale, and it has three questions, such as:
Gays should be allowed to marry.
The U.S. should adopt universal health care.
Abortion should be outlawed.
If you ran a PCA on the raw scores from such a survey, you'd expect it to load heavily on one dimension which could reasonably interpreted as a liberal/conservative dimension. This despite the fact that liberals would give 1s on the first two and 5 on the last. The different direction of the last question has zero effect on the strength of the associations.
When using the principal component as a measure, doublecheck that observations with low and high scores on the component are sensible and that your software hasn't flipped the scale from what you'd intuitively expect. You don't need to pre-transform your variables before running the PCA.
posted by shadow vector at 6:38 PM on August 18, 2008
Let's say you have a survey with Likert scale items asking people to rate agreement from Strongly Agree to Strongly Disagree on a 5 point scale, and it has three questions, such as:
Gays should be allowed to marry.
The U.S. should adopt universal health care.
Abortion should be outlawed.
If you ran a PCA on the raw scores from such a survey, you'd expect it to load heavily on one dimension which could reasonably interpreted as a liberal/conservative dimension. This despite the fact that liberals would give 1s on the first two and 5 on the last. The different direction of the last question has zero effect on the strength of the associations.
When using the principal component as a measure, doublecheck that observations with low and high scores on the component are sensible and that your software hasn't flipped the scale from what you'd intuitively expect. You don't need to pre-transform your variables before running the PCA.
posted by shadow vector at 6:38 PM on August 18, 2008
Response by poster: Right, I know. I didn't pre-transform and I understand that PCA doesn't care about sign. For instance, in your example above say that, for some reason, you have strong theoretical reasons for suspecting that liberals will also think that abortion SHOULD be outlawed (suspend your disbelief). Theoretically, you will then expect that most of these variables will load positively on this factor. However, when you run the PCA, you find that the third question loads negatively but with a high loading. When constructing your "liberal" scale, do you a) transform the third score so that higher scores indicate more liberalism even though it goes against your theory and previous research or b) stick with your theory and construct the scale as such?
posted by proj at 6:44 PM on August 18, 2008
posted by proj at 6:44 PM on August 18, 2008
I would stop and make every effort to understand why the counterintuitive result is happening before going any further. Any measure is itself a theory, and if it's one you don't understand or aren't comfortable with you shouldn't be using it.
posted by shadow vector at 6:58 PM on August 18, 2008
posted by shadow vector at 6:58 PM on August 18, 2008
Well, PCA doesn't go off and find DEVELOPMENT for you if you feed it some measures you think are related to development. It's just a dumb algorithm and doesn't give a fuck whether the dimensions it finds make any substantive sense.
First I would check that I hadn't fucked up the coding. It's terrifying easy to forget that you've flipped the sign or order of a variable and then flip it back with a recode. Look at some cases and examine the values directly. The data hygiene solution is to never, ever recode a variable onto itself, but always only onto a new variable so you can check the construction at any point.
But I suspect that the answer is just that "development" isn't the best dimension among those four variables.
In the medium term, there are things you can do. I'm assuming that you're building an index because you're trying to create a dependent variable to model; otherwise just include all four variables if they're not too collinear.
One, you could just say "I built an index with principal components" and hope nobody asks how the variables load. When people see that what you call a development index penalizes per capita GDP or education or whatever, you are le fucked.
Two, just run multiple models with each of your four as DV. Are they reasonably consistent? Do prime theoretical movers switch signs when you go from a positive-loader to a negative-loader?
Three, use someone else's index of development. HDI or whatever. Just avoid the problem altogether.
For each of the three variables, higher scores indicate more "desirable" scores for societies on measures of development. The fourth variable is a macro-level variable that indicates the percent of a society that is comprised of people with a certain characteristic.
Please tell me you're not trying to combine dependent and explanatory variables into an index. If your theory is that more people with that characteristic --> more development, it belongs on the other side of the = from your development measures.
posted by ROU_Xenophobe at 7:31 PM on August 18, 2008
First I would check that I hadn't fucked up the coding. It's terrifying easy to forget that you've flipped the sign or order of a variable and then flip it back with a recode. Look at some cases and examine the values directly. The data hygiene solution is to never, ever recode a variable onto itself, but always only onto a new variable so you can check the construction at any point.
But I suspect that the answer is just that "development" isn't the best dimension among those four variables.
In the medium term, there are things you can do. I'm assuming that you're building an index because you're trying to create a dependent variable to model; otherwise just include all four variables if they're not too collinear.
One, you could just say "I built an index with principal components" and hope nobody asks how the variables load. When people see that what you call a development index penalizes per capita GDP or education or whatever, you are le fucked.
Two, just run multiple models with each of your four as DV. Are they reasonably consistent? Do prime theoretical movers switch signs when you go from a positive-loader to a negative-loader?
Three, use someone else's index of development. HDI or whatever. Just avoid the problem altogether.
For each of the three variables, higher scores indicate more "desirable" scores for societies on measures of development. The fourth variable is a macro-level variable that indicates the percent of a society that is comprised of people with a certain characteristic.
Please tell me you're not trying to combine dependent and explanatory variables into an index. If your theory is that more people with that characteristic --> more development, it belongs on the other side of the = from your development measures.
posted by ROU_Xenophobe at 7:31 PM on August 18, 2008
Response by poster: No, I'm not trying to do that. I am, however, addressing some research that came close to do that.
posted by proj at 7:32 PM on August 18, 2008
posted by proj at 7:32 PM on August 18, 2008
Response by poster: Oh, and the reason I am building an index is because they are highly collinear.
posted by proj at 7:34 PM on August 18, 2008
posted by proj at 7:34 PM on August 18, 2008
Best answer: Do you mean that you have 4 IVs that are collinear, so you're creating an index with PC and using that instead of all of them?
In that case, you must have some theory that development --> DV, or are testing someone else's theory that development --> DV. It must also be the case that your paper isn't about the differences between the different development measures, or you wouldn't want to mush them together into one index.
In which case my response would be to do one of two things after first noting that you have 4 collinear measures and can't include all of them:
First, just pick one of them and run with it. In the text or a footnote, say something like "This is just one obvious proxy for development. You might instead use blah, foo, or bar, like in these cites. I can't include all of them here because they are collinear (see correlation table, or their VIFs are above 20), but the results do not change if I include any of these other variables instead of the one I used."
Second, just run all four models and report them. There shouldn't be big differences between them. If there are, then the different models are tapping into different aspects of what we mean by development, and that's interesting too. Also, if there are differences between the models, then that should mean that they're not simply collinear, and that you might be able to include more than one at a time, even if not all four. I mean, if there were a really bad collinearity problem, you'd expect for the other coefficients to remain about the same (or the same but sign-flipped) as you switch out the development variables.
If I were reviewing and saw a principal-components-created index in this circumstance, I might well think "Hmm. The simple thing to do would be to just run multiple models, but they didn't do that. I bet what's going on is that none of the variables are significant on their own, but the index is. I'd better make sure that the index is measuring development and nothing else, because that means that everything is riding on the index. Also, I don't appreciate people trying tricks like that."
I mean, I know that using principal components is an accepted way to deal with collinearity. But it seems to be from the "model fit statistics are God" school of thought that's, well, wrong in social science.
posted by ROU_Xenophobe at 9:20 PM on August 18, 2008
In that case, you must have some theory that development --> DV, or are testing someone else's theory that development --> DV. It must also be the case that your paper isn't about the differences between the different development measures, or you wouldn't want to mush them together into one index.
In which case my response would be to do one of two things after first noting that you have 4 collinear measures and can't include all of them:
First, just pick one of them and run with it. In the text or a footnote, say something like "This is just one obvious proxy for development. You might instead use blah, foo, or bar, like in these cites. I can't include all of them here because they are collinear (see correlation table, or their VIFs are above 20), but the results do not change if I include any of these other variables instead of the one I used."
Second, just run all four models and report them. There shouldn't be big differences between them. If there are, then the different models are tapping into different aspects of what we mean by development, and that's interesting too. Also, if there are differences between the models, then that should mean that they're not simply collinear, and that you might be able to include more than one at a time, even if not all four. I mean, if there were a really bad collinearity problem, you'd expect for the other coefficients to remain about the same (or the same but sign-flipped) as you switch out the development variables.
If I were reviewing and saw a principal-components-created index in this circumstance, I might well think "Hmm. The simple thing to do would be to just run multiple models, but they didn't do that. I bet what's going on is that none of the variables are significant on their own, but the index is. I'd better make sure that the index is measuring development and nothing else, because that means that everything is riding on the index. Also, I don't appreciate people trying tricks like that."
I mean, I know that using principal components is an accepted way to deal with collinearity. But it seems to be from the "model fit statistics are God" school of thought that's, well, wrong in social science.
posted by ROU_Xenophobe at 9:20 PM on August 18, 2008
More to the point, if you're not a scholar of development in particular, then I would not try to create a new index of development. Use one of theirs, or just use normal variables.
posted by ROU_Xenophobe at 9:22 PM on August 18, 2008
posted by ROU_Xenophobe at 9:22 PM on August 18, 2008
This thread is closed to new comments.
posted by a robot made out of meat at 5:50 PM on August 18, 2008