# Statistical conundrum

March 13, 2021 4:56 AM Subscribe

I have a set of anonymous survey data that was sent out via email and social media at two time points. Some of the people who filled it out at time point 1 probably did so again at time point 2, but because the survey was anonymous, I don't know the proportion of overlap. What is the best way to compare time point 1 and time point 2?

There were hundreds of responses each time; > 400 people for time 1 and > 1800 for time 2. The overlap could range from 0 to 400+ (i.e. everyone at time 1 filled it out at time point 2, plus another ~1400 or so at time 2). Unfortunately I have no way of determining how paired/unpaired the data is.

My original thought was that since the sample size is so large, it would be ok to just do an unpaired t-test and note the potential for correlation in the limitations section of the paper. However, one reviewer has suggested something called a "optimal pooled t-test," which best I can tell involves weighting the paired and unpaired data slightly differently. Makes sense, but is that even doable without knowing how much overlap is in your sample? From this analysis of partially paired data, it looks like the unpaired t-test and pooled t-test are pretty similar -- pooled has more power for highly paired data, but if anything that implies that if I detect a difference between Time 1 and Time 2 with an unpaired test, the true difference will be even greater. Right?

In retrospect, I wish I had included a statistician in the study design before we even got started -- lessons learned! I have emailed for a local stats consult but it will be a few days before they get back to me, and I'm hoping for a little hivemind guidance so that I can make the most of my time with the stats person.

There were hundreds of responses each time; > 400 people for time 1 and > 1800 for time 2. The overlap could range from 0 to 400+ (i.e. everyone at time 1 filled it out at time point 2, plus another ~1400 or so at time 2). Unfortunately I have no way of determining how paired/unpaired the data is.

My original thought was that since the sample size is so large, it would be ok to just do an unpaired t-test and note the potential for correlation in the limitations section of the paper. However, one reviewer has suggested something called a "optimal pooled t-test," which best I can tell involves weighting the paired and unpaired data slightly differently. Makes sense, but is that even doable without knowing how much overlap is in your sample? From this analysis of partially paired data, it looks like the unpaired t-test and pooled t-test are pretty similar -- pooled has more power for highly paired data, but if anything that implies that if I detect a difference between Time 1 and Time 2 with an unpaired test, the true difference will be even greater. Right?

In retrospect, I wish I had included a statistician in the study design before we even got started -- lessons learned! I have emailed for a local stats consult but it will be a few days before they get back to me, and I'm hoping for a little hivemind guidance so that I can make the most of my time with the stats person.

From what you mention, you can't tell which data points pair with which, if any.

Unfortunately, you can't use a paired t-test or the linked partially-paired test either, without this information. I don't see any way to guess who pairs with who, or even to estimate the proportion of repeats.

posted by Maxwell_Smart at 7:44 AM on March 13 [1 favorite]

Unfortunately, you can't use a paired t-test or the linked partially-paired test either, without this information. I don't see any way to guess who pairs with who, or even to estimate the proportion of repeats.

posted by Maxwell_Smart at 7:44 AM on March 13 [1 favorite]

Response by poster: a robot made of meat: I'm not looking for change within/between subjects. The data does not really come from a longitudinal cohort; it's repeated cross-sectional, looking for change on an aggregate level.

Maxwell_Smart: that's right. Even metadata like IP address isn't useable to link records because people might be taking it from different devices/locations at times 1 and 2. Sounds like with this "potentially partially paired" dataset, the only option is to use an unpaired test, then, and acknowledge the limitation.

posted by basalganglia at 10:01 AM on March 13

Maxwell_Smart: that's right. Even metadata like IP address isn't useable to link records because people might be taking it from different devices/locations at times 1 and 2. Sounds like with this "potentially partially paired" dataset, the only option is to use an unpaired test, then, and acknowledge the limitation.

posted by basalganglia at 10:01 AM on March 13

I'm a statistician. There's a "two statisticians, three opinions" sort of aesthetic in this field sometimes, but honestly my first-pass solution (with some minimal assumptions on the data distribution) would have been the one you proposed initially. I would not expect your choice (unpaired vs paired t-test) to affect the estimated difference, just the standard error. You're right that not knowing the pairing reduces power and so should make a test of the difference more conservative.

That being said, if I were your consulting statistician, probably my first advice to you would be to make a graph or two :). And I'd want to hear first what it is you most wanted to learn from your data, to make sure that we weren't missing the forest for the trees. Sometimes a scientist comes to me with a question like this one that boils down to a detail, but it turns out there's another statistical approach that fits the substantive question better.

posted by eirias at 11:36 AM on March 13 [2 favorites]

That being said, if I were your consulting statistician, probably my first advice to you would be to make a graph or two :). And I'd want to hear first what it is you most wanted to learn from your data, to make sure that we weren't missing the forest for the trees. Sometimes a scientist comes to me with a question like this one that boils down to a detail, but it turns out there's another statistical approach that fits the substantive question better.

posted by eirias at 11:36 AM on March 13 [2 favorites]

You are not logged in, either login or create an account to post comments

posted by a robot made out of meat at 7:26 AM on March 13 [1 favorite]