# Can you give me stats direction? Yes/No?

November 6, 2011 3:06 PM Subscribe

(StatisticsFilter) I'd like to compare two binomial variables from the same population...

... It seems like it should be easier than I'm making it. Setup: 1 population of samples (about 40), each of which can be negative or positive (0-1) for two variables. I'd like to determine whether the distributions of the two binomial variables are statistically significant.

Imagine a 40-row-by-3-column table, filled with 0s and 1s.

These aren't means, so I can't use a chi-square test (or really any other row-column test I've seen). Help?

... It seems like it should be easier than I'm making it. Setup: 1 population of samples (about 40), each of which can be negative or positive (0-1) for two variables. I'd like to determine whether the distributions of the two binomial variables are statistically significant.

Imagine a 40-row-by-3-column table, filled with 0s and 1s.

These aren't means, so I can't use a chi-square test (or really any other row-column test I've seen). Help?

A couple clarification questions:

A distribution itself can't be "statistically significant" - what about the distributions are you trying to measure? Are you trying to figure out if the two variables have means which are different by an amount that is statistically significant? Do you want to know if the two variables are correlated?

Why does your imaginary table have three columns? You only mentioned two variables.

posted by Salvor Hardin at 3:13 PM on November 6, 2011

A distribution itself can't be "statistically significant" - what about the distributions are you trying to measure? Are you trying to figure out if the two variables have means which are different by an amount that is statistically significant? Do you want to know if the two variables are correlated?

Why does your imaginary table have three columns? You only mentioned two variables.

posted by Salvor Hardin at 3:13 PM on November 6, 2011

If I'm understanding you right, you've got variables x1 and x2 that can take values 0 and 1, for the same 40 samples, and you want to know if there's a difference between the two variables. Instead of a 40x2 table (the unique identifier column doesn't count in describing tables that way) what you want is a 2x2 of frequency crosstabulating the distribution of 0 and 1 by X1 and X2. You can use a chi-squared test to compare the difference between the observed and expected frequencies.

posted by gingerest at 3:26 PM on November 6, 2011

posted by gingerest at 3:26 PM on November 6, 2011

So, hypothetically...

You have 40 people.

You have a variable blue eye where they either get 0 (no) or 1 (yes)

You have a variable friendly where they either get 0 (no) or 1 (yes)

You say that you want to test the statistical significant of the distributions (but at SH said, that isn't possible).

Do you mean that you want to know if it is more likely that a combination of blue eye + friendly occurs more often than a combination of not blue eye + not friendly or blue eye + not friendly?

posted by k8t at 3:28 PM on November 6, 2011

You have 40 people.

You have a variable blue eye where they either get 0 (no) or 1 (yes)

You have a variable friendly where they either get 0 (no) or 1 (yes)

You say that you want to test the statistical significant of the distributions (but at SH said, that isn't possible).

Do you mean that you want to know if it is more likely that a combination of blue eye + friendly occurs more often than a combination of not blue eye + not friendly or blue eye + not friendly?

posted by k8t at 3:28 PM on November 6, 2011

IANAS, but I think you are trying to estimate the likelihood that the same distribution (functional form and parameters) generated the 2 samples that you have. That won't work if the variables are different (which is why I think you have 3 variables instead of 2).

If that is not the case, then a simple test of proportions should tell you if the samples are generated by the same underlying distribution.

posted by rasputin98 at 3:32 PM on November 6, 2011

If that is not the case, then a simple test of proportions should tell you if the samples are generated by the same underlying distribution.

posted by rasputin98 at 3:32 PM on November 6, 2011

Sorry, all, for my hazy/inaccurate description. So, yes, a 40x2 table. And, yes, thank you for correcting my terminology. What I meant was, I'm trying to determine if the distribution of the variables is correlated (rather than each column's total proportion). Or - how k8t described it with the blue eye + friendly example.

posted by slab_lizard at 5:16 PM on November 6, 2011

posted by slab_lizard at 5:16 PM on November 6, 2011

*What I meant was, I'm trying to determine if the distribution of the variables is correlated*

Confused. Is there a reason you're not doing a correlation, then?

*These aren't means, so I can't use a chi-square test*

Sure they are. You have the mean of X1 when X2 is zero, and the mean of X1 when X2 is one.

posted by ROU_Xenophobe at 6:37 PM on November 6, 2011

Ah. You are NOT comparing two proportions in separate samples from the same underlying population; you are trying to get at whether two binary variables measured in the same 40 individuals are related beyond chance. Yes, you will still want to reframe your data in terms of a 2x2 table, and you will want to use a chi-squared test to evaluate it.

Using blue-eyed and friendly as our variables which take yes-no as their values, you can set up a 2x2 table of counts (blue-eyed y/n the rows; friendly y/n the columns; so the cells are friendly+blue, unfriendly+blue, friendly+notblue, unfriendly+notblue).

You are interested in whether the frequencies in each cell differ from what you would expect from drawing a very large number of samples of 40 people from that same universe if the two variables were completely unrelated. You will be comparing the expected cell count to the actual cell count.

The expected cell count is the product of the marginal row total by the marginal column total divided by the total number of observations for the table. So to get the expected value for the friendly blue cell, you would divide the product of total friendly and total blue by 40.

Chi-squared is the sum of ((the squared difference between observed count and expected count)/expected count) for the four cells of your table. The degrees of freedom are (number of rows-1)(number of columns-1)=(2-1)(2-1)=1.

Does that help?

posted by gingerest at 11:21 PM on November 6, 2011

Using blue-eyed and friendly as our variables which take yes-no as their values, you can set up a 2x2 table of counts (blue-eyed y/n the rows; friendly y/n the columns; so the cells are friendly+blue, unfriendly+blue, friendly+notblue, unfriendly+notblue).

You are interested in whether the frequencies in each cell differ from what you would expect from drawing a very large number of samples of 40 people from that same universe if the two variables were completely unrelated. You will be comparing the expected cell count to the actual cell count.

The expected cell count is the product of the marginal row total by the marginal column total divided by the total number of observations for the table. So to get the expected value for the friendly blue cell, you would divide the product of total friendly and total blue by 40.

Chi-squared is the sum of ((the squared difference between observed count and expected count)/expected count) for the four cells of your table. The degrees of freedom are (number of rows-1)(number of columns-1)=(2-1)(2-1)=1.

Does that help?

posted by gingerest at 11:21 PM on November 6, 2011

« Older Please help me buy Iranian sohan in the UK | It's like Chopped. I just don't want it all in... Newer »

This thread is closed to new comments.

posted by k8t at 3:11 PM on November 6, 2011