Stats-filter: Comparing individual means?
December 6, 2012 8:26 AM   Subscribe

StatsFilter: How do I compare the means of an individual at three different points?

Ok, so I'm working on a project just now and I want to find out if the means of my dependent variable X for ONE person at times A, B and C are significantly from one another. I'm not especially interested in finding out if this speaker is significantly different from other speakers, since that's part of my regression analysis that I'm also doing.

To give you an idea of what I'm looking at, let's say that we interview a speaker and find out that they say the word 'bag' in a particular way and we take measurements of that. In year 1, I take 200 observations of this word, in year 2, I take 240 observations, and in year 3, I take 175 observations. Each observation has its own line in my excel sheet, who says it, what kind of word it is (noun, verb etc), what came before it, and so on. Now what I want to know is if the way that the person says 'bag' is significantly different in Year 1 compared to Year 2 compared to year 3.

The data is not normal, so ANOVA seems to be out, plus the data are unbalanced (so I have different numbers of observations at times A, B and C). I did think that Kruskal-Wallis would work, but that seems to only apply to group means while I'm just interested in comparing the value of X for one person at three different times.

In terms of stats packages, I have SPSS and R, although if the solution is with R, I'll need step-by-step instructions for it cause I'm a noob with it.

Any stats-aces out there able to help a brother out?
posted by Scottie_Bob to Science & Nature (13 answers total) 1 user marked this as a favorite
you said you are interested in the value of ONE person at three different times. Does this mean you're only interested in comparing the data from one particular individual (and that you don't care at all about the other data)?

Or does it mean that you want to employ a repeated measures (within subjects) analysis?

If it's the former, and your data is not normal for that particular subject, a kruskal-wallis test sounds like the way to go.

If it's the latter, a friedman test is the way to go.

(this is based on this info).
posted by spacediver at 9:35 AM on December 6, 2012

Yeah, ONE person at THREE different times, with the dependent variable of 'how do they pronounce BAG'. I do care about the other data, but since it's kind of all over the place, I want to split it up so I can do the following:

Speaker A at Year 1, Year 2 and Year 3
Speaker B at Year 1, Year 2 and Year 3
Speaker C at Year 1 and Year 3 (no data for year 2)

I don't really want to compare Speaker A, Speaker B and Speaker C, since I'm also doing a mixed-effects linear regression model, which will show me which variables are most important in the modelling the variation.

I think.
posted by Scottie_Bob at 10:12 AM on December 6, 2012

It's unclear how you set up your design.

When you say you had 200 observations in year 1, you're talking about 200 observations per speaker? Or are you talking about 200 speakers?

And was the way in which they pronounced "bag" just one of these observations?
posted by spacediver at 10:19 AM on December 6, 2012

Yeah, the first one; 200 observations per speaker per year (well, it's unbalanced, so it's not exactly the same amount per speaker per year), and yes, the way in which they pronounced "bag" was one of those observations. If the speaker had said something like "I grabbed the bag and ran away from the man" then that'd be four tokens of "a" (grabbed, bag, ran, man).
posted by Scottie_Bob at 10:25 AM on December 6, 2012

Ok so you have one measurement for this speaker at three different times. I fail to see how you can do any sort of statistical analysis if you only have three numbers to compare. And how can the data be non-normal if the three conditions only comprise one data point?

I'm envisioning your data like this:

Speaker A (whom you are interested in):

Bag pronunciation style

time 1: 5
time 2: 6
time 3: 9

(i'm using arbitrary units for the dependent measure).

Can you clarify some more?
posted by spacediver at 11:40 AM on December 6, 2012

Ah sorry, I don't think I'm explaining it very well! Sorry about that... :(

Ok, so I have a bunch of measurements for three speakers at three different times.

Speaker A:

Bag pronunciation style (we measure them in hertz and then transform into another value, but for the moment, let's just go with arbitrary numbers!)

Time 1: 5.67, 4.67, 8.79, 4.23, 5.39, 4.56 (and so on, for, say, 200 tokens)
Time 2: 8.98, 7.86, 5.99, 4.05, 3.49, 8.90 (and so on, for, say, 230 tokens)
Time 3: 5.46, 7.68, 3.56, 7.89, 4.45, 6.79 (and so on, for, say, 180 tokens)

We can take the above and say that I've done the same for Speaker B and Speaker C (albeit with different values).

So, I have a bunch of measurements for this one variable and I'm wondering if for any one speaker, is the difference between the measurements of BAG statistically different at Time 1, Time 2, and Time 3 (so comparing 1~2, 1~3, 2~3).

It *looks* like the KW works, since my data is non-normal and so on, but it'd be good to get clarification from someone!
posted by Scottie_Bob at 12:02 PM on December 6, 2012

Before I attempt to answer this, I need clarification on what you mean by "for any one speaker".

You have three speakers, but you're only interested in speaker A?

Or are you interested to see whether, as a group, they show any patterned difference between time 1, 2, and 3?

Note that this is very different from looking at differences BETWEEN speakers.

I really think you should look up the difference between independent measures and repeated measures design. It sounds like you clearly need a repeated measures analysis here but I don't think you understand what it means.

You should also be asking yourself whether you want to see whether these differences are generalizable to a more general population, or whether you're ONLY interested in whether there are differences for a PARTICULAR individual between time 1 2 and 3
posted by spacediver at 12:14 PM on December 6, 2012

Ok, I'll try again, with more details in case it helps... :(

I did a study looking at how adolescents spoke, and I interviewed about 20 or so of them over three years of fieldwork. These 20 speakers formed four different groups (Group A, Group B, Group C, Group D), at Year 1, Year 2 and Year 3, and I was interested in how the social group they belonged to influenced how they sounded. Each year I interviewed these speakers, I collected a bunch of data. In each year, I measured how they said words like 'bag' (see above), which meant that there was a bunch of measurements to be taken. There was no experimental design, so they didn't have to do a test or take a tablet or anything like that. I just interviewed them, placed them in groups, and analysed their data. For that reason, I don't think something like rANOVA works, especially since that still needs the data to be normally distributed...

1) In this chapter I'm working on just now, I'm only looking at a subset of speakers, one each from Group A, Group B and Group C.
2) I'm looking at each of their 'bag' data at Time 1, Time 2 and Time 3.

So what I want to know is that for Speaker A, Speaker B and Speaker C, is how they say 'bag' (not just the one measurement, but all of the measurements I took) in Year 1, statistically significant from how they say it in Year 2 and Year 3 (and so on).

I don't want to lump them all into one group because they're not. I'm also not interested in more generalisable differences, but only in whether there are differences for a particular individual at time 1 compared to time 2 compared to time 3.

I can imagine that you're getting frustrated with me here (sorry!), but please stick with me! :)
posted by Scottie_Bob at 12:45 PM on December 6, 2012

So do you want a separate statistical analysis, each for speaker A, and B, and C?
posted by spacediver at 12:49 PM on December 6, 2012

Yes, yes I do :)

I'll then run a mixed-effects regression analysis on the whole dataset to see which independent variables effect the dependent variable the most.
posted by Scottie_Bob at 12:51 PM on December 6, 2012


Here is how I would think about this.

Let's assume you're only interested in one speaker (you can apply this thinking and methodology for each speaker separately). I'll also explain the concept imagining that you were only dealing with TWO sets of data (rather than the three you're interested in), and then build up to three.

Jane has two sets of data, each with a few hundred observations. It is critical here to understand that each set of data is to be considered a sample from all the possible measurements you could have taken from Jane at that particular time. So at time 1, you made 200 measurements (out of a potential infinite number of measurements), and this is your sample. You are using this sample to make estimates about the "infinite population" that theoretically exists. Similarly for time 2.

You are interested in whether there is a statistically significant difference between these two sets. Specifically, you are estimating the probability of obtaining such a difference between these two data sets, assuming there was in fact no difference in the underlying "infinite populations" at each time point. In other words, suppose that at time 1 and time 2, Jane's speech is actually identical. Then what is the probability of obtaining the two sample means that you DID in fact obtain when you measured her 200 times the first time and 230 times the second time. If you obtain a sufficiently low probability, then you can reject the null hypothesis.

If the data are normal, etc. you would perform the above analysis using an independent measures t-test. If they weren't normal, you'd use a Mann-Whitney U test.

Since you have three sets of data, you use a Kruskal Wallis Test. I've never done one in SPSS, but there is a youtube tutorial here.

You should also check to see that the assumptions of the test are met as indicated here (specifically you need to ensure that the three distributions have the same shape).

Unequal sample sizes is not a problem - the test should weight the individual variances accordingly, similarly to how a t test pools the variances when using unequal sample sizes.

It is crucial that you understand that you are using multiple observations of a SINGLE subject to generalize to the population of observations of THAT subject. You cannot make statistical inferences about other individuals (i.e. this is more of a series of case studies).
posted by spacediver at 1:22 PM on December 6, 2012

Righto, I think I've got it. The data are the same shape, and I did a few exploratory runs to make sure about the homogeneity of variances (following that youtube tutorial), and it suggested that the KW was ok to use.

I'm definitely *not* using multiple observations of a SINGLE subject to make statistical inferences about other individuals. For the other speakers, I'll run the KW test separately.

Phew, this stats stuff is hard!

Thanks so much for your help today; I've been watching this thread like a hawk!
posted by Scottie_Bob at 1:42 PM on December 6, 2012

aye, no prob, glad to help, and hope it all works out :)
posted by spacediver at 3:03 PM on December 6, 2012

« Older Help, please!   |   Books about the Bodhisatva of Wisdom? Newer »
This thread is closed to new comments.