Alternatives to a t-test?
July 21, 2011 8:55 AM

If I have a dependent categorical variable with two values (say, basketball fans vs non-fans) and an independent numerical variable (say, height in millimeters) what statistical tests could I use to analyze my data, besides a t-test?

I'm hoping to be able to do more than just say that there is/isn't a significant difference between basketball fans vs non-fans' mean heights. Ideally there's a way for me to make some claim about the influence that height has on the likelihood of being a fan.

(It's been a while since I've taken a stats class, so I've no idea if there's a lot of answers to this question or whether I'm stuck. If there's no way for me to make predictions about fandom based on height, what kind of data would I need in order to analyze in that way?)
posted by danceswithanonymity to Science & Nature (13 answers total) 3 users marked this as a favorite
I think a logistic regression will give you what you are looking for (namely, the probability that a person of a given height will be a basketball fan).
posted by pemberkins at 9:30 AM on July 21, 2011


The influence of height on being a basketball fan? I think the best you could do would be a t-test where you would use fan vs not fan as the independent (grouping) variable and height as the dependent variable and just say whether the mean heights differ. Logistic regression would give you the same result, unless you have other predictors to throw into the equation (e.g., gender, ethnicity). You certainly couldn't make a causal statement with only those data points. Being a fan is probably influenced by a whole lot of factors (e.g., family of origin, peer influences, gender roles) and you really can't control for those unless you have more data.

Do you imagine there's going to be a linear relationship, such as people who are 6'4" will be more likely to be fans than people who are 6'1"? If you have some kind of idea like that, you could break the sample into chunks based on height ranges and see the proportion of people who are fans in each group. If you found that only 20% of people 5' to 6' are fans versus 50% of people 6' and over, that would be suggestive, though the power of this finding would be highly influenced by other factors (e.g., sample size, number of people in each group, gender composition of groups, socio-economic factors) that you may or may not have control over.

For these kinds of data to suggest anything other than a correlation (which is not equal to causation), you need a lot of other datapoints that would allow you to rule out other plausible influences that might lead people to become basketball fans.
posted by jasper411 at 9:32 AM on July 21, 2011


If you're passingly familiar with regression, logit or probit. There are occasional theoretical reasons to prefer one of them, but I can't recall what they are... in any case, they will almost always provide you with the same inferences.

If you vaguely recall regression or remember hearing about OLS, I'd do that first. Dunno what package you're using to analyze, but it shouldn't be excel. Whatever it is, it will spit out coefficients for you including HEIGHT and CONSTANT or INTERCEPT. These define a line where PROBABILITY OF BEING A FAN = CONSTANT + HEIGHT*(actual height). That is, in the OLS model. each millimeter increases the probability of being a fan by the coefficient HEIGHT.

Let's say that the coefficient is positive (taller people are more likely to be fans). A problem is that you're virtually certain to predict that people over some tall height have a 115% chance of being a fan, while very short people have a -9% chance. Which is flatly impossible.

Logit and probit get around this problem and so are favored for binary dependent variables. The problem is that you can't directly interpret the coefficients as probability statements -- instead there's an intermediate step. Documentation should be able to walk you through that step.
posted by ROU_Xenophobe at 9:34 AM on July 21, 2011


(More on the ins and outs of logistic regression. And I should clarify that this test only gives you probabilities, and oughtn't be interpreted as saying anything about the "influence" or causality of height on fan-hood).
posted by pemberkins at 9:35 AM on July 21, 2011


Logistic regression would give you the same result, unless you have other predictors to throw into the equation (e.g., gender, ethnicity).

Not quite.

Both the t-test and a logit model (= logistic regression) will tell you directly whether there's a difference in heights between fans and nonfans.

But the logit will also tell you how sharp that effect is, and (with some work) what the uncertainty around that effect is.

As in a lot of cases, t-tests (and similar things) screen, but regression models describe.
posted by ROU_Xenophobe at 9:37 AM on July 21, 2011


Also, are you actually modeling being a fan as a function of height, or was that just an example?

Because if that was just an example, there may be techniques out there that are intended to model the precise situation that you're dealing with and that take into account (or at least make reasonable assumptions about) other aspects of that situation.
posted by ROU_Xenophobe at 9:41 AM on July 21, 2011


Here you go...
logistic regression (but you'll need another independent variable)
discriminant analysis
posted by k8t at 9:44 AM on July 21, 2011


Remember that logistic regression comes with an assumption of linearity for the effect of height (on a very unusual scale to most people). Well kind of, it does something meaningful regardless of the shape, though you often want to look for non-linear relationships as well. That approach can be complex or simple depending on how far you want to push it.

If you only have one quantitative variable, it's just as useful to plot the histograms of height for each group (or a qqplot) and test / describe it that direction. The exception is if you have the rare situation in which causality can be determined from the data, and the direction is known.
posted by a robot made out of meat at 9:49 AM on July 21, 2011


and oughtn't be interpreted as saying anything about the "influence" or causality of height on fan-hood).

Right, yes - I picked "influence" because I was trying to avoid implying causality and I don't know what the lingo is. Whoops?

Also, are you actually modeling being a fan as a function of height, or was that just an example?

Just an example, with the salient factors (dependence, type and # of variables) the same.

logistic regression (but you'll need another independent variable)

Hmm. I have one additional independent variable I can use, but it's categorical, not continuous. I'd also be highly surprised if that additional variable wasn't much more highly correlated with the dependent variable than the variable I'm interested in. (I guess a good analogy would be gender, apologies to all my fellow lady ballers.)

If you only have one quantitative variable, it's just as useful to plot the histograms of height for each group (or a qqplot) and test / describe it that direction.

I know what a histogram is but I'm not sure what you mean by "describe it in that direction".
posted by danceswithanonymity at 10:02 AM on July 21, 2011


logistic regression (but you'll need another independent variable)

This is wrong.

I have one additional independent variable I can use, but it's categorical, not continuous.

You can use catagorical variables as regression predictors. It is typical to "dummy" them into a set of binary (0/1) variables and leave one baseline category out.

I know what a histogram is but I'm not sure what you mean by "describe it in that direction".

"The mean height of basketball fans was a (95% CI =c-d) and of non-fans r (95%CI=s-t). In figure 1 we have plotted the histogram of heights for fans (red) and non-fans (black). As you can see, the biggest difference is in the probability of being extremely tall (h>2 m) f% in fan, g% in non-fans."
posted by a robot made out of meat at 10:41 AM on July 21, 2011


Just an example, with the salient factors (dependence, type and # of variables) the same.

Then what is it really?

I only ask because any model is going to compare your results to a null hypothesis, or, really, a null data-generating process. That is, it's going to ask "How unlikely would it be to obtain the results that we see here if nothing interesting were going on?"

But, some statements of "nothing interesting is going on" might be inappropriate for some kinds of data. Not so much because the data are binary or whatever, but that's true too, but more because of the underlying nature of the data. ISTR an old example where what would be an otherwise boringly ordinary null data-generating process wasn't necessarily appropriate for analyzing DNA, because DNA doesn't randomly vary in the way specified by that null process.

Anyway, the thing to do really is consult the literature that analyzes whatever your DV is and see what those people have done.
posted by ROU_Xenophobe at 11:09 AM on July 21, 2011


Then what is it really?

I'm attempting to look at the relationship between campaign contributions and a specific vote in a specific legislature. You're right that there's probably a literature out there for how to do this kind of thing, though.
posted by danceswithanonymity at 11:34 AM on July 21, 2011


Okay...

So for the raw relationship (people receiving $X had a Y% probability of voting yea), logit or probit will be fine in the sense that they are commonly used to analyze single votes. Which one doesn't really matter, but logit is marginally easier to deal with.

I assume your other variable is party. You need that too. It would probably make sense to do this either with an interaction between party and contributions, or run separate models for Democrats and Republicans in addition to the unified model.

But.

That's only if really all you want is to learn that people receiving $X had a Y% probability of voting yea.

If you want to try to understand the effect that contributions had on votes, you've got a long and difficult road ahead of you because of -- and you need to imagine this in 300 point letters, on fire, while Samuel L Jackson intones it in a deep booming voice that shatters stone --

ENDOGENEITY*.

The problem is that with a simple logit or probit you can't tell -- at all-- the difference between "I gave this candidate money because I like his or her positions" and "I voted this way because people gave me money to." I mean, really, not even in the slightest. Trying to suss this out is ugly ugly ugly stuff involving instrumental variables and the like, or trying to find natural experiments. Or, alternately, trying to find additional, non-vote evidence that supports one causal direction or the other. Or, with some more data, maybe a matching model.

A lot of what the right thing to do here is will depend on why you're doing it. What's right for an undergraduate paper is not the same thing as what's right for a newspaper article or think-tank report, is not the same thing as what's right for a conference presentation or seminary paper in grad school.

*Presumably Samuel L would actually say something like "Endogeneity, motherfucker," or "Endogeneity -- do you sass** it?"

**OF COURSE he routinely quotes little bits from Hitchhiker's Guide.

posted by ROU_Xenophobe at 12:42 PM on July 21, 2011


« Older Daily sites?   |   Help me understand image size and monitor... Newer »
This thread is closed to new comments.