April 17, 2010 11:58 AM Subscribe

I have a wonderfully large dataset that I'm working with for a long-term project. I am analyzing a small section the dataset for my masters thesis. In meeting with my thesis advisor last week, she suggested I run some statistical tests of significance on the 4 tables I'm working with. She knows that I am yet to be versed in quantitative analysis methods (I've done solely qualitative work thus far) and that I'm under a massive time crunch to get this done. She suggested I seek help from others, as she doesn't want me to get bogged down with figuring out this step, and would rather I concentrate on analyzing the other aspects of this data. To this end, I'm wondering if somebody might be able to suggest the best type of test of significance to run, the easiest way to run it, and a good, simple resource for what the resultant values mean?

I have 4 tables in an Excel spreadsheet: two of them show the counts for 7 variables across 5 age groups; two of them show the counts of 7 variables across 3 gender categories. I can visually see that there are meaningful differences between the distribution of the 7 variables across the age groups; much harder to tell with the gender categories. I need the mathy numbers to back up my arguments for why these differences are occurring and how they are trending. Can you help me here? Thanks!

By helping, you are not doing my homework for me. The real 'homework' is in the examination of the reasons for these trends in context, which is what I'm primarily working on here. Thanks!

PS. I should add that I am a sociolinguist and so this work is being done in the field of Humanities, i.e. iamnotamath(s)person.
posted by iamkimiam to Education (19 answers total) 6 users marked this as a favorite

I have 4 tables in an Excel spreadsheet: two of them show the counts for 7 variables across 5 age groups; two of them show the counts of 7 variables across 3 gender categories. I can visually see that there are meaningful differences between the distribution of the 7 variables across the age groups; much harder to tell with the gender categories. I need the mathy numbers to back up my arguments for why these differences are occurring and how they are trending. Can you help me here? Thanks!

By helping, you are not doing my homework for me. The real 'homework' is in the examination of the reasons for these trends in context, which is what I'm primarily working on here. Thanks!

PS. I should add that I am a sociolinguist and so this work is being done in the field of Humanities, i.e. iamnotamath(s)person.

Sounds like the basic problem is "observed vs. expected", i.e.,: "do the numbers of cases in the cells differ significantly from the numbers that would be expected by a random distribution of cases into the same cells?"

From the general frame of your topic I am assuming there is no expectation that the underlying data would be normally distributed. The standby for that problem is "chi-square" so long as all 'expected cell' frequencies are greater than 5. However a 5X7 cross tabulation is pretty large, can you lump some of the categories?

This would be to help determine if your tables differed significantly from random. To compare the "observed" tables together without a comparison to the "expected" table, I'm not sure where I would go with that.

posted by Rumple at 12:23 PM on April 17, 2010

From the general frame of your topic I am assuming there is no expectation that the underlying data would be normally distributed. The standby for that problem is "chi-square" so long as all 'expected cell' frequencies are greater than 5. However a 5X7 cross tabulation is pretty large, can you lump some of the categories?

This would be to help determine if your tables differed significantly from random. To compare the "observed" tables together without a comparison to the "expected" table, I'm not sure where I would go with that.

posted by Rumple at 12:23 PM on April 17, 2010

Do you know anyone in the Stats/Econ department? 'Cause it sounds like you want a simple regression model for the data, and maybe some confidence interval/p-value stuff (gosh, it's been so long for me I barely remember the words...). There's software that does it easily, and any econ/stats major, even an undergrad, would have it...

posted by ifjuly at 12:42 PM on April 17, 2010

posted by ifjuly at 12:42 PM on April 17, 2010

What you want is rbrul, which was designed specifically for this sort of sociolinguistic modeling. It's an R package, but it's *way, way more user-friendly* than using R directly, and it basically walks you through the steps you need to follow.

(Your advisor might be used to Goldvarb, which*used* to be the standard program for doing quantitative sociolinguistics. But it turns out there are some pretty serious problems with Goldvarb that rbrul fixes.)

posted by nebulawindphone at 1:16 PM on April 17, 2010 [1 favorite]

(Your advisor might be used to Goldvarb, which

posted by nebulawindphone at 1:16 PM on April 17, 2010 [1 favorite]

Let me axe you something first.

So in your first table, you have age down the side, and you have completely different variables in each column. Or the other way around, doesn't matter.

Is that right?

Because if it is, you don't have four tables statistically, you have 7 5x1 tables all mushed together.

posted by ROU_Xenophobe at 1:23 PM on April 17, 2010

So in your first table, you have age down the side, and you have completely different variables in each column. Or the other way around, doesn't matter.

Is that right?

Because if it is, you don't have four tables statistically, you have 7 5x1 tables all mushed together.

posted by ROU_Xenophobe at 1:23 PM on April 17, 2010

It depends on what the 7 variables are and what level of measurement characterizes them.

What are the 7 variables? If they are continuous and at "interval" level, you could run analysis of variance (ANOVA) for the age groups and then for the gender groups (since it has 3 levels, you couldn't run t-tests). An example of this would be height in inches, or the amount of times they say "um" when talking.

These types of tests can be run on pretty much any statistical analysis software package (e.g., SPSS, SAS, etc.)

MeMail or call if you want more consult/help.

posted by jasper411 at 1:45 PM on April 17, 2010

What are the 7 variables? If they are continuous and at "interval" level, you could run analysis of variance (ANOVA) for the age groups and then for the gender groups (since it has 3 levels, you couldn't run t-tests). An example of this would be height in inches, or the amount of times they say "um" when talking.

These types of tests can be run on pretty much any statistical analysis software package (e.g., SPSS, SAS, etc.)

MeMail or call if you want more consult/help.

posted by jasper411 at 1:45 PM on April 17, 2010

This is a not such a straightforward question to answer without a closer look at your data and study design. What are your predictor and response variables? What sort of distribution do they follow? Are your predictor variables independent or is there a reason to worry about multicolinearity?

My best advice is for you to go talk to a statistical consultant. Contact the stats (or math) department at state, and see if they offer this service (most universities do). Typically you will email details about your problem to the designated consultant, and then meet with them. The first hour or two is free (and should be sufficient for your needs).

If state doesn't have such a service, see if Cal will let you stop in (link).

I want to reiterate that you really need to use caution when running tests that you are not entirely familiar with. You can put your data through pretty much any test (especially with point and click software), and get results. There is no built in protection against misuse.

posted by special-k at 2:12 PM on April 17, 2010 [2 favorites]

My best advice is for you to go talk to a statistical consultant. Contact the stats (or math) department at state, and see if they offer this service (most universities do). Typically you will email details about your problem to the designated consultant, and then meet with them. The first hour or two is free (and should be sufficient for your needs).

If state doesn't have such a service, see if Cal will let you stop in (link).

I want to reiterate that you really need to use caution when running tests that you are not entirely familiar with. You can put your data through pretty much any test (especially with point and click software), and get results. There is no built in protection against misuse.

posted by special-k at 2:12 PM on April 17, 2010 [2 favorites]

It's been a long time since I've done stats.

Chi-square should would work for a count-by-gender analysis the way Rumple described. Keep in mind, though, chi-square, assumes that the differences between groups for which you've got counts are qualitative (male vs. female, or generally, group A vs. B vs. C). It will work for your age data if you're looking for cohort differences in your age groups (eg. are 20-30 yr. olds more or less likely to use X - whatever you're counting - than other groups?) but not progressive age differences (are people are more likely to use X as they get older?).

What analysis you use will depend on what questions you're asking of your age data. And you've got two tables by age and two by gender. Assuming the seven variables are the same, is there a third categorical variable (like geographical area) you're separating the age and gender variables by, or two different data sets? You might want an ANOVA.

I don't think I can answer your question, but answering these questions will probably get you better answers.

posted by nangar at 2:20 PM on April 17, 2010

Chi-square should would work for a count-by-gender analysis the way Rumple described. Keep in mind, though, chi-square, assumes that the differences between groups for which you've got counts are qualitative (male vs. female, or generally, group A vs. B vs. C). It will work for your age data if you're looking for cohort differences in your age groups (eg. are 20-30 yr. olds more or less likely to use X - whatever you're counting - than other groups?) but not progressive age differences (are people are more likely to use X as they get older?).

What analysis you use will depend on what questions you're asking of your age data. And you've got two tables by age and two by gender. Assuming the seven variables are the same, is there a third categorical variable (like geographical area) you're separating the age and gender variables by, or two different data sets? You might want an ANOVA.

I don't think I can answer your question, but answering these questions will probably get you better answers.

posted by nangar at 2:20 PM on April 17, 2010

If it's counts, I agree with others recommending chi-square as a good place to start. You can run it once for each variable, across age then gender. So break the tables into 7 smaller ones, one for each variable. 5x7 is way too big to interpret easily. Seven 5x1 tables will be much easier.

Then your interpretation would be, are there statistically significant differences for the counts in variable 1 (in the first table) for the different levels of age, then separately interpret variable 2, and so on. Then, the same thing again but with gender. You'd be looking at 14 different tables this way, and a result for each variable. You won't get the size, or the direction of the effect really, but you can do that by eye. There's also the gamma co-efficient, but interpretation of that is hazy so I'd stick to the size of the numbers and whether chi-square is significant. Of course what you need most of all is time to take in these new concepts and become comfortable with interpreting them.

If you want to quickly see how to do this kind of thing in SPSS with a walkthrough of interpretation, and don't mind shelling out for a book, the Andy Field covers all the basics and isn't too pricey as stats books go.

Oh, you could also produce a few bar charts/histograms to help visualise what the possible patterns might be. Again in SPSS* it's easy to run off basic descriptive stats and charts. If you're not comfortable with quantitative stuff, keep it simple, you'll be surprised by how far you can go with just a few tools like this.

* Not recommending SPSS because it's particularly good, by the way, but it's very commonly available and has a shallow learning curve for the basics.

posted by danteGideon at 2:37 PM on April 17, 2010

Then your interpretation would be, are there statistically significant differences for the counts in variable 1 (in the first table) for the different levels of age, then separately interpret variable 2, and so on. Then, the same thing again but with gender. You'd be looking at 14 different tables this way, and a result for each variable. You won't get the size, or the direction of the effect really, but you can do that by eye. There's also the gamma co-efficient, but interpretation of that is hazy so I'd stick to the size of the numbers and whether chi-square is significant. Of course what you need most of all is time to take in these new concepts and become comfortable with interpreting them.

If you want to quickly see how to do this kind of thing in SPSS with a walkthrough of interpretation, and don't mind shelling out for a book, the Andy Field covers all the basics and isn't too pricey as stats books go.

Oh, you could also produce a few bar charts/histograms to help visualise what the possible patterns might be. Again in SPSS* it's easy to run off basic descriptive stats and charts. If you're not comfortable with quantitative stuff, keep it simple, you'll be surprised by how far you can go with just a few tools like this.

* Not recommending SPSS because it's particularly good, by the way, but it's very commonly available and has a shallow learning curve for the basics.

posted by danteGideon at 2:37 PM on April 17, 2010

Like Special-K said, without knowing your design and hypotheses, it is next to impossible to help you do the most appropriate analysis technique.

PS, this is what your advisor is SUPPOSED to do with you. Seriously.

posted by k8t at 8:44 PM on April 17, 2010

PS, this is what your advisor is SUPPOSED to do with you. Seriously.

posted by k8t at 8:44 PM on April 17, 2010

You should definitely use a statistical consultant. If they're not available, then you should pay a stats grad student for a few hours of help. It's that important.

posted by acidic at 10:20 PM on April 17, 2010

posted by acidic at 10:20 PM on April 17, 2010

What acidic and k8t said -- this is not a problem you can solve on the internet.

But also: your department is doing you a serious disservice if they don't offer you the resources to learn the methodological skills (pretty basic ones from the sounds of things) to do your thesis. Seriously.

posted by paultopia at 12:07 AM on April 18, 2010

But also: your department is doing you a serious disservice if they don't offer you the resources to learn the methodological skills (pretty basic ones from the sounds of things) to do your thesis. Seriously.

posted by paultopia at 12:07 AM on April 18, 2010

Thanks for all the advice so far! You've given me a lot to think about. Also, I am very aware of the limitations of my school/department...I am on the last two-tenths of what has been a very challenging run, to say the least. I'm trying to finish, so that I can begin a new race, where I will be devoting my time to full analysis of the data and comprehensive exploration of quantitative methods in sociolinguistics. Until then, I have the next couple of days to figure this out. I also have a better explanation of exactly what I'm trying to do, if that can help anybody help me further at this point...

Keeping things as simple as possible: There is one item*. This item comes in seven 'flavors'. People were asked to pick one and only one flavor that was their favorite. The order of the most preferred flavors is fairly predictable, based on other factors entirely. However, the amount by which a specific flavor is preferred changes with age group. For each flavor, I need to know if the difference in percentage between say, the 18-29 age group and the 30-39 is meaningful (statistically significant). There is a definite trend (the younger age groups are way more diverse in their flavor preferences), and there are reasons for that trend...I have already written this up, and I can explain and justify this based on my background knowledge of 'flavors'. I can see the patterns, I have done several charts illustrating what's going on (for some flavors, the difference in preference between age groups is 15% or more; others it's only 1-2%...don't know if that's important or not, and especially considering the distribution of ages of the respondents - the majority of respondents were in the 18-29 and 30-39 age groups).

Anyways, I really like the seven 5x1 approach. I want to look into the chi-square and the rbrul suggestions upthread...given the extra info I've provided, do you think these are still the best routes to take?

Thanks again!

*By 'item', I mean 'word'; by 'flavor', I mean 'pronunciation'. I just wanted to keep my explanation simple here.

posted by iamkimiam at 5:29 PM on April 18, 2010

Keeping things as simple as possible: There is one item*. This item comes in seven 'flavors'. People were asked to pick one and only one flavor that was their favorite. The order of the most preferred flavors is fairly predictable, based on other factors entirely. However, the amount by which a specific flavor is preferred changes with age group. For each flavor, I need to know if the difference in percentage between say, the 18-29 age group and the 30-39 is meaningful (statistically significant). There is a definite trend (the younger age groups are way more diverse in their flavor preferences), and there are reasons for that trend...I have already written this up, and I can explain and justify this based on my background knowledge of 'flavors'. I can see the patterns, I have done several charts illustrating what's going on (for some flavors, the difference in preference between age groups is 15% or more; others it's only 1-2%...don't know if that's important or not, and especially considering the distribution of ages of the respondents - the majority of respondents were in the 18-29 and 30-39 age groups).

Anyways, I really like the seven 5x1 approach. I want to look into the chi-square and the rbrul suggestions upthread...given the extra info I've provided, do you think these are still the best routes to take?

Thanks again!

*By 'item', I mean 'word'; by 'flavor', I mean 'pronunciation'. I just wanted to keep my explanation simple here.

posted by iamkimiam at 5:29 PM on April 18, 2010

Man, seriously, your department/advisor/whatever is high on the Sucking List right now.

posted by k8t at 9:40 PM on April 18, 2010

posted by k8t at 9:40 PM on April 18, 2010

I think you can use chi-squares here. However, I'm not a statistician, just a former psych major who took some classes in this stuff a long time ago.

jasper411 offered to help via email. You could take them up on it. You could also try contacting nebulawindphone - I'd think this would be OK. If he's busy, he can say so. To the best of my recollection he's also a linguistics grad student.

Ideally, you'd be talking to friendly stats professor. Can you talk to your advisor again and see if she can recommend somebody in your department, or the stats department, to talk to? If not, can you talk to another professor and see if they can recommend somebody to talk to?

posted by nangar at 6:17 AM on April 19, 2010

jasper411 offered to help via email. You could take them up on it. You could also try contacting nebulawindphone - I'd think this would be OK. If he's busy, he can say so. To the best of my recollection he's also a linguistics grad student.

Ideally, you'd be talking to friendly stats professor. Can you talk to your advisor again and see if she can recommend somebody in your department, or the stats department, to talk to? If not, can you talk to another professor and see if they can recommend somebody to talk to?

posted by nangar at 6:17 AM on April 19, 2010

Thanks, nangar. I've put some feelers out there. I'm certainly accepting any and all help I can get. Long story, but this situation is unexpected and frankly, downright embarrassing. And the time-frame is insanity. But I've got to put my ego and frustration aside and just not focus on my feelings about it...I've got work to do. I've been looking into the chi-square and that seems to be a good way to go about this. If anybody has some additional advice about using this method, a hand 'n brain to lend, or just some words of encouragement, I'd sure appreciate it!

posted by iamkimiam at 11:07 AM on April 19, 2010

posted by iamkimiam at 11:07 AM on April 19, 2010

Rumple: *The standby for that problem is "chi-square" so long as all 'expected cell' frequencies are greater than 5.*

From some cursory reading, ANOVA doesn't look like a good match for the data - it presumes normal distributions. Chi-square should work here for estimating independence between variables. And there's a Javascript page that runs the chi-squares calculations. However, Rumple's quote worries me.

posted by Pronoiac at 4:19 PM on April 19, 2010

From some cursory reading, ANOVA doesn't look like a good match for the data - it presumes normal distributions. Chi-square should work here for estimating independence between variables. And there's a Javascript page that runs the chi-squares calculations. However, Rumple's quote worries me.

posted by Pronoiac at 4:19 PM on April 19, 2010

Definitely, if you want to see whether the differences in frequencies between your groups are statistically significant, chi-squared will tell you (as much as anything can). It's a good exploratory tool for this type of categorical data. Borrow a book on SPSS from your library, spend an hour in front of SPSS in a lab, and you'll have all the results. If you get your excel spreadsheet organised into one sheet for each set of data, you can import it straight in. Then take the results away and interpret according to the book.

If you want a great explanation of how and why chi-squared works, try and get hold of my favourite introductory stats book by Agresti and Finlay (doesn't have to be the 4th edition). Remember, you are not stupid. This is hard. But do-able!

posted by danteGideon at 4:26 PM on April 19, 2010 [1 favorite]

If you want a great explanation of how and why chi-squared works, try and get hold of my favourite introductory stats book by Agresti and Finlay (doesn't have to be the 4th edition). Remember, you are not stupid. This is hard. But do-able!

posted by danteGideon at 4:26 PM on April 19, 2010 [1 favorite]

You MaFités really know how to come through! Thanks to everybody's help, suggestions, links and encouragement, I have successfully conquered the chi-square (well, enough for one day/week/masters-thesis). And hey, I even learned out to pronounce it correctly...oh the meta irony!

Seriously, you people are tops.

posted by iamkimiam at 11:00 PM on April 19, 2010

Seriously, you people are tops.

posted by iamkimiam at 11:00 PM on April 19, 2010

This thread is closed to new comments.

posted by stinker at 12:15 PM on April 17, 2010