Ref.Cat? s.e.? Stats help needed!
January 26, 2012 5:12 AM Subscribe
Would you be kind to someone who is culturally proficient, but statistically deficient? Specifically, what is going on in the tables at the end of this paper?
Hopefully this isn't too do-my-homework-y for ask, but I got sent this paper, which seems to be saying some interesting thing about who gets involved with the arts in Britain, but when it comes to tables 3 and 4, I feel like I'm having my face rubbed in my own decrepitude. If anyone can explain this as though to a dullard, that'd be great.
Hopefully this isn't too do-my-homework-y for ask, but I got sent this paper, which seems to be saying some interesting thing about who gets involved with the arts in Britain, but when it comes to tables 3 and 4, I feel like I'm having my face rubbed in my own decrepitude. If anyone can explain this as though to a dullard, that'd be great.
It's a multinomial logit (it googles). Since there are more than 2 outcomes (O vs U vs OL in music) the analyst picks one as the baseline, and computes betas for the other two groups (which represent changes in the probability of being in that group on a non-linear scale) for all the predictors with the constraint that the probabilities for being in the 3 groups have to add up to one. Picking the baseline outcome group is arbitrary, and the joint betas are hard to interpret because of the constraint, so one way to present the results is like in table 4 where you vary the baseline group. Alternatively you can present them all at the same time, or present a chained regression ( OL vs not OL, then given that someone is not OL O vs U) - I think that's what they're doing in Table 3.
The reference category business is because many of the predictors are also categories. In regression there's always an "intercept" or constant term which asks "what's the probability of being in group A given that all the predictors are zero". However, for numerical predictors you can arbitrarily decide what "zero" means (like degrees centigrade vs Fahrenheit), and the same thing is true of categories. For example, the analyst decided that the "intercept" would reflect people who are single, and so reports a coefficient for how different those who are married are.
There's an alternative constraint that some people use where instead of forcing the intercept to be one level of a predictor it is made to correspond to the average - even if that's not a real level (like being half-single). Then you report a coefficient for being single vs average and married vs average.
posted by a robot made out of meat at 6:31 AM on January 26, 2012
The reference category business is because many of the predictors are also categories. In regression there's always an "intercept" or constant term which asks "what's the probability of being in group A given that all the predictors are zero". However, for numerical predictors you can arbitrarily decide what "zero" means (like degrees centigrade vs Fahrenheit), and the same thing is true of categories. For example, the analyst decided that the "intercept" would reflect people who are single, and so reports a coefficient for how different those who are married are.
There's an alternative constraint that some people use where instead of forcing the intercept to be one level of a predictor it is made to correspond to the average - even if that's not a real level (like being half-single). Then you report a coefficient for being single vs average and married vs average.
posted by a robot made out of meat at 6:31 AM on January 26, 2012
themel has it. Forgive me if this is too obvious, but I'll only add that betas with an asterisk or two indicates "statistical significance". That is, there is a low enough* probability that the interaction is caused by chance, so it's taken as a "real" effect.
* I don't know why 5% is the threshold for being a real effect. I'm sure it's historical, but careers have been made and lost because of this number. It essentially means that at this beta value cutoff, random noise would show significance 5% of the time. (For two asterisks: 1% of the time.) This is important to keep in mind when looking at a giant table of values (or in a better example from my own research, a head full of EEG electrodes): if only 5% of them reach p < .05 significance, you've found nothing at all. (It's more complicated than this, but that's the gist.) This is called type 1 error.
posted by supercres at 7:24 AM on January 26, 2012
* I don't know why 5% is the threshold for being a real effect. I'm sure it's historical, but careers have been made and lost because of this number. It essentially means that at this beta value cutoff, random noise would show significance 5% of the time. (For two asterisks: 1% of the time.) This is important to keep in mind when looking at a giant table of values (or in a better example from my own research, a head full of EEG electrodes): if only 5% of them reach p < .05 significance, you've found nothing at all. (It's more complicated than this, but that's the gist.) This is called type 1 error.
posted by supercres at 7:24 AM on January 26, 2012
This is an example of people who do things with data having no damn clue how to communicate it. That's what's going on (and it drives me nuts!).
posted by entropone at 7:34 AM on January 26, 2012
posted by entropone at 7:34 AM on January 26, 2012
I am a logit modeler, but I am not your logit modeller, and the following does not constitute logit modelling advice.
Actually, I'm not a statistician at all, but I can't sleep, so I skimmed the paper expecting to know nothing and was pleasantly shocked to the point I sat up in bed and said "Oh hey! Multinomial logit models!" As a side note, this paper is really jargony and not very clear; my guess is it's a distillation of a longer work that sets more of this out in detail.
Okay, so let's walk through table 3. It's a multinomial logit model, which is basically an analytical technique that represents the choice between two or more distinct alternatives as a mathematical function. There are actually three models here in this table; the first one, which is described in the first two columns, is the one we'll focus on.
The basic idea is this: People are making a choice between two alternatives for people's consumption of Theatre, Dance and Cinema (TDC): they could be Univores (U), a group of people who watch movies, but don't do any of the other stuff, or they could be Omnivores (O), people who watch movies, go to live theatre, and attend dance productions. This choice comes from the latent class analysis they did to produce table 1; basically what they tried to do is simplify the wide range of choices one could make in terms of cultural activities into classifying everybody into these two groups so that they were the most similar internally and the least similar between the groups, because trying to analyze a million choices of "Mostly goes to movies, but does attend the pantomime at Christmas" and "Ballet lover" and "Watches movies, but mostly Jason Statham movies, and goes to musicals, but mostly ones like the Queen and the Abba ones" and whatever would be insanity. So we have people choosing, in the TDC domain, to be either Omnivores or Univores.
The point of using the MNL form is to simultaneously consider several different aspects of the choice. They could do much simpler analysis and find that professionals are higher consumers, that people with higher education are higher consumers and that people with higher income are higher consumers. But most professionals have advanced education and earn high wages, so is this one group of rich doctors having their cell phones go off at the theatre, or is one of the factors stronger than another? So basically, the MNL model says that the value people place on each choice is a function of a number of different attributes about the people and/or the choice. (In this case, it's all about the people choosing, but MNL is common in marketing, where the choice is between products of different attributes, and in transport modelling, where one choice is between modes of travel with different attributes. I come from the latter group of nerds, by the way.) The MNL model also says a number of other things, like that the alternatives are all independent of each other (unimportant here) and that the "error term", which is the part of the choice that the model can't explain, is normally distributed (actually Weibull), which is a pretty good assumption and quickly gets confusing to wade into.
So what the model does can be thought of as giving a score to each of the choices, where this list of parameters (the betas) represents the elements of the score. The way this is being presented, the score is entirely for the Omnivore alternative, and the score for the Univore alternative is 0. Once these scores have been calculated, you could find the probability of someone being an omnivore; it's e^(score for O) / ( e^(score for O) + e^(score for U) ). As the score for O goes up, people are more likely to choose it. (For future reference, if there are more than two alternatives, then the probability is always e^alternative you're interested in / sum of e^all alternatives.) If you wanted, you could set up a spreadsheet and actually look at the influence of the parameters, or you could just look at their magnitude, significance and sign to gather more general conclusions, which is what the authors of this paper did.
To jump to the bottom of the table, there's a constant associated with each of these models; in the TDC case, it's -2.118. So for a person with none of the other aspects of the model above it means that they have a e^-2.118 / (e^-2.118 + e^0) = 0.107 = 10.7% chance of being an omnivore. This is a large, negative utility (another term for score), so in general, people are much more likely to be univores. Furthermore, we know the standard error for this parameter, which is essentially how certain we are of its' value. This is described in the s.e. column, and it's 0.292. From this, we can calculate the t-statistic with respect to 0 by dividing the parameter by the standard error, so the t-statistic is -7.25, which is a high degree of significance. 1.96 (the sign doesn't matter here) reflects a 95% chance the parameter is different from zero. To avoid us calculating the t-statistic for each parameter, they've helpfully put one asterisk beside those where the probability that the parameter is 0 is 0.05 or less, and two where that probability is 0.01. The point being, that they've come up with a value for each of the parameters, but many of them aren't statistically significant; they could indicate something, but it could just as easily be totally irrelevant.
So one thing about estimating a model of this type is that (for reasons that are way over my head but I think involve linear algebra) you can't calculate the score if you try to come up with parameters for everybody. Basically, because we're looking at relative preferences, there needs to be a base we're referring to. For something like age or income, there is an implicit base, 0 - even if the data doesn't have people with 0 age, the ages are all relative to that 0 point. However, for something like gender, you can't have a value for both men and women, because (notwithstanding a complex digression on society and gender roles) everybody's in one of those two camps. So instead, they created a binary variable, where Male is 0, and Female is 1. That's the Ref.Cat., or reference category, they're talking about. It's clear here, but may not be elsewhere, so they spell it out in all cases, which is pretty much the only particularly helpful or clear thing they've done in the whole paper. So in this case, considering only the constant and the female parameter, women have a score of (0.615 * 1) + -2.118, while men have a score of (0.615 * 0) + -2.118.
Right, so we're ready to actually look at the parameters and draw conclusions. What we see is that women, relative to men, are strongly more likely to be omnivores, because the ballet is for chicks. Well, maybe not, but it's a clear and strong distinction here that the other two models in table 3 don't share. People who are married or separated (as opposed to single) may be more likely to be cultural omnivores, but that's not an inference the data strongly supports. Age is totally irrelevant, although that may be due to the model form; what it really says is there is no single rule leading to additional consumption as you get older -- my guess is that the age effect is more complicated, because the very elderly, who have the highest age, generally don't get out much. The next set of parameters refers to the presence of a child in the household, with childless people being the reference category. So there's a strong reduction in consumption of TDC associated with people who have a very young child, with essentially no relationship for people with no child or older children. That's interesting.
Quickly skimming through the rest of these parameters, we see that the part of England has some, but not a significant amount, of effect (the North is very nearly significant at the 95% level). Income is strongly correlated with more consumption, although they frustratingly don't give an actual scale -- my guess would be thousands of pounds, which makes that parameter pretty big once you multiply it out; a 50K pound income would add 0.026*50=1.3 to the score for omnivores. The effect for higher education is both large and strongly significant as well, while the effect of class is much smaller, except for routine workers. The status value is large and significant, although not as big as the education. (I don't know what the scale for status is, so it's hard to say exactly what this means relative to some of the other aspects - is it 0 to 1? 0 to 1000?) In theory, you could make comparisons between the parameters and their magnitudes at this point -- getting an O-level has a broadly similar effect in the consumption of culture as moving from the routine work class to the high professional class, or having your child go from being 0-4 to older, or having your income increase by 25K, or trading your penis in for ovaries, or one "Status", whatever that means.
The important thing is that all of these aspects are being considered together; that this model controls for the fact that some people live in London, where, I assume, there is more TDC to enjoy, but which would be correlated with higher incomes and more professional employment. And it tries to explicitly tease apart the effects of class vs. status vs. income vs. education, which is pretty cool. Education seems to be the most important, especially in the other two models in the table.
Just as an aside, this is fascinatingly British from my North American perspective; we don't have class as a statistical category here, we use occupation which mostly corresponds, but not entirely, and isn't described in such explicit "higher" and "lower" terms. And that status thing just blew my puny lizard brain; I'd never think to look at something like that, and wouldn't even know where to start.
Hopefully this helps -- thanks for pointing me to a really interesting study that is so far removed from the literature I read that I'd never see it!
posted by Homeboy Trouble at 7:38 AM on January 26, 2012 [2 favorites]
Actually, I'm not a statistician at all, but I can't sleep, so I skimmed the paper expecting to know nothing and was pleasantly shocked to the point I sat up in bed and said "Oh hey! Multinomial logit models!" As a side note, this paper is really jargony and not very clear; my guess is it's a distillation of a longer work that sets more of this out in detail.
Okay, so let's walk through table 3. It's a multinomial logit model, which is basically an analytical technique that represents the choice between two or more distinct alternatives as a mathematical function. There are actually three models here in this table; the first one, which is described in the first two columns, is the one we'll focus on.
The basic idea is this: People are making a choice between two alternatives for people's consumption of Theatre, Dance and Cinema (TDC): they could be Univores (U), a group of people who watch movies, but don't do any of the other stuff, or they could be Omnivores (O), people who watch movies, go to live theatre, and attend dance productions. This choice comes from the latent class analysis they did to produce table 1; basically what they tried to do is simplify the wide range of choices one could make in terms of cultural activities into classifying everybody into these two groups so that they were the most similar internally and the least similar between the groups, because trying to analyze a million choices of "Mostly goes to movies, but does attend the pantomime at Christmas" and "Ballet lover" and "Watches movies, but mostly Jason Statham movies, and goes to musicals, but mostly ones like the Queen and the Abba ones" and whatever would be insanity. So we have people choosing, in the TDC domain, to be either Omnivores or Univores.
The point of using the MNL form is to simultaneously consider several different aspects of the choice. They could do much simpler analysis and find that professionals are higher consumers, that people with higher education are higher consumers and that people with higher income are higher consumers. But most professionals have advanced education and earn high wages, so is this one group of rich doctors having their cell phones go off at the theatre, or is one of the factors stronger than another? So basically, the MNL model says that the value people place on each choice is a function of a number of different attributes about the people and/or the choice. (In this case, it's all about the people choosing, but MNL is common in marketing, where the choice is between products of different attributes, and in transport modelling, where one choice is between modes of travel with different attributes. I come from the latter group of nerds, by the way.) The MNL model also says a number of other things, like that the alternatives are all independent of each other (unimportant here) and that the "error term", which is the part of the choice that the model can't explain, is normally distributed (actually Weibull), which is a pretty good assumption and quickly gets confusing to wade into.
So what the model does can be thought of as giving a score to each of the choices, where this list of parameters (the betas) represents the elements of the score. The way this is being presented, the score is entirely for the Omnivore alternative, and the score for the Univore alternative is 0. Once these scores have been calculated, you could find the probability of someone being an omnivore; it's e^(score for O) / ( e^(score for O) + e^(score for U) ). As the score for O goes up, people are more likely to choose it. (For future reference, if there are more than two alternatives, then the probability is always e^alternative you're interested in / sum of e^all alternatives.) If you wanted, you could set up a spreadsheet and actually look at the influence of the parameters, or you could just look at their magnitude, significance and sign to gather more general conclusions, which is what the authors of this paper did.
To jump to the bottom of the table, there's a constant associated with each of these models; in the TDC case, it's -2.118. So for a person with none of the other aspects of the model above it means that they have a e^-2.118 / (e^-2.118 + e^0) = 0.107 = 10.7% chance of being an omnivore. This is a large, negative utility (another term for score), so in general, people are much more likely to be univores. Furthermore, we know the standard error for this parameter, which is essentially how certain we are of its' value. This is described in the s.e. column, and it's 0.292. From this, we can calculate the t-statistic with respect to 0 by dividing the parameter by the standard error, so the t-statistic is -7.25, which is a high degree of significance. 1.96 (the sign doesn't matter here) reflects a 95% chance the parameter is different from zero. To avoid us calculating the t-statistic for each parameter, they've helpfully put one asterisk beside those where the probability that the parameter is 0 is 0.05 or less, and two where that probability is 0.01. The point being, that they've come up with a value for each of the parameters, but many of them aren't statistically significant; they could indicate something, but it could just as easily be totally irrelevant.
So one thing about estimating a model of this type is that (for reasons that are way over my head but I think involve linear algebra) you can't calculate the score if you try to come up with parameters for everybody. Basically, because we're looking at relative preferences, there needs to be a base we're referring to. For something like age or income, there is an implicit base, 0 - even if the data doesn't have people with 0 age, the ages are all relative to that 0 point. However, for something like gender, you can't have a value for both men and women, because (notwithstanding a complex digression on society and gender roles) everybody's in one of those two camps. So instead, they created a binary variable, where Male is 0, and Female is 1. That's the Ref.Cat., or reference category, they're talking about. It's clear here, but may not be elsewhere, so they spell it out in all cases, which is pretty much the only particularly helpful or clear thing they've done in the whole paper. So in this case, considering only the constant and the female parameter, women have a score of (0.615 * 1) + -2.118, while men have a score of (0.615 * 0) + -2.118.
Right, so we're ready to actually look at the parameters and draw conclusions. What we see is that women, relative to men, are strongly more likely to be omnivores, because the ballet is for chicks. Well, maybe not, but it's a clear and strong distinction here that the other two models in table 3 don't share. People who are married or separated (as opposed to single) may be more likely to be cultural omnivores, but that's not an inference the data strongly supports. Age is totally irrelevant, although that may be due to the model form; what it really says is there is no single rule leading to additional consumption as you get older -- my guess is that the age effect is more complicated, because the very elderly, who have the highest age, generally don't get out much. The next set of parameters refers to the presence of a child in the household, with childless people being the reference category. So there's a strong reduction in consumption of TDC associated with people who have a very young child, with essentially no relationship for people with no child or older children. That's interesting.
Quickly skimming through the rest of these parameters, we see that the part of England has some, but not a significant amount, of effect (the North is very nearly significant at the 95% level). Income is strongly correlated with more consumption, although they frustratingly don't give an actual scale -- my guess would be thousands of pounds, which makes that parameter pretty big once you multiply it out; a 50K pound income would add 0.026*50=1.3 to the score for omnivores. The effect for higher education is both large and strongly significant as well, while the effect of class is much smaller, except for routine workers. The status value is large and significant, although not as big as the education. (I don't know what the scale for status is, so it's hard to say exactly what this means relative to some of the other aspects - is it 0 to 1? 0 to 1000?) In theory, you could make comparisons between the parameters and their magnitudes at this point -- getting an O-level has a broadly similar effect in the consumption of culture as moving from the routine work class to the high professional class, or having your child go from being 0-4 to older, or having your income increase by 25K, or trading your penis in for ovaries, or one "Status", whatever that means.
The important thing is that all of these aspects are being considered together; that this model controls for the fact that some people live in London, where, I assume, there is more TDC to enjoy, but which would be correlated with higher incomes and more professional employment. And it tries to explicitly tease apart the effects of class vs. status vs. income vs. education, which is pretty cool. Education seems to be the most important, especially in the other two models in the table.
Just as an aside, this is fascinatingly British from my North American perspective; we don't have class as a statistical category here, we use occupation which mostly corresponds, but not entirely, and isn't described in such explicit "higher" and "lower" terms. And that status thing just blew my puny lizard brain; I'd never think to look at something like that, and wouldn't even know where to start.
Hopefully this helps -- thanks for pointing me to a really interesting study that is so far removed from the literature I read that I'd never see it!
posted by Homeboy Trouble at 7:38 AM on January 26, 2012 [2 favorites]
Response by poster: Sweet lord above, mefites, I love you, your ancestors, your progeny and all the people you wish the best for. I always think it's hokey when people mark all the answers as the best answer, so I am instead leaving this comment to say that all these answers are the best answer, and some of them are even better than that. Superduperthanks all round, you bunch of swells!
posted by robself at 3:30 PM on January 26, 2012
posted by robself at 3:30 PM on January 26, 2012
« Older What is this pungent, purple flower that has... | How to create online layout exercise for... Newer »
This thread is closed to new comments.
The betas are regression coefficients, indicating how strongly a certain dependent variable influences the independent variables. If its magnitude is large, the influence is strong, and if the sign is negative, having the independent variable actually makes having the dependent variable less likely (e.g. people from "Class 7" are less likely to be musical omnivores than people from other classes since they have a negative beta).
s.e. is the standard error, indicating how good the estimate of beta is assumed to be given the input data.
Ref.cat. indicates reference categories, meaning e.g. that a number is not from comparing "The North" to "Not the North", but rather from a comparison against a different subsample.
posted by themel at 5:57 AM on January 26, 2012