How do I account for twins in statistical analysis?
September 18, 2007 7:48 PM   Subscribe

How do I account for twins in statistical analysis? By 'twins' I mean two people born at the same time, rather than any unknown-to-me technical meaning for the word 'twins' in statistics.

I have two groups of children, split into two groups along the lines of those who were given a certain medication, and those who weren't. We want to look at differences between the two groups. This would normally be all fine and dandy, however in the cohort there are twins which would suggest that each twin is not 100% independent, but of course nor are they 100% the same. When the children were randomised to the two groups they were all treated as independent, so there would be some twins who are both in one group only, and some twins with one sibling in each group.

How do I account for this in my analysis? I'm using SPSS 15 so information specific to that would be great, but any general information about what tests to use and relevant topics would also be really helpful. I've tried a google search, but I seem to just get a lot of journal articles about twins and statistics tutorials with lines such as, "of course we would also account for twins, but that is too complex a topic to cover here".
posted by teem to Science & Nature (14 answers total) 1 user marked this as a favorite
I assume you mean "identical twins".

My gut feeling is to ignore them, because they have been randomly assigned. If it was the case that twins were deliberately always kept in the same group, or were always kept in different groups, then that would be more of a concern. If you look at the facts that:

(a) siblings were split up, or not, randomly
(b) you don't seem to have any biological evidence that being an identical twin would be of relevance (after all, while being genetically identical, there can still be differences between twins due to diet or other factors)

It seems to me that any effort to complicate the analysis by taking into account twins would probably reduce your statistical power by introducing additional parameters into the model. This might not the answer you want, but as an ecologist, I'm used to facing up to the fact that my data is full of noise and nastiness.
posted by Jimbob at 7:58 PM on September 18, 2007

Some of the twins were identical, some were not (each set of twins was raised together). Our outcomes were behaviour and allergy, so there is a lot of potential overlap between biology and environment.
posted by teem at 8:04 PM on September 18, 2007

Are twins the only sets of children from the same family, or are there also nontwin siblings in there? Because nontwin siblings will also not be fully independent.

Assuming there are other nontwin children, and bearing in mind that I come from a regression-oriented world rather than an experiment-oriented one and that I have no special knowledge about dealing with twins:

So simple it avoid the problem: Ignore it until a reviewer forces you to address the issue, then do whatever the reviewer tells you to.

Simplest: use standard errors corrected for clustering on family. In stata, this is as simple as appending a ", cluster(variable)" to the end of most estimation commands; I have no idea how you'd do this in spss.

Less simple: Use multilevel techniques to model this. Again you'd have children grouped into families. But you could model the family groups either by allowing each family to have its own intercept, which you can predict (a random intercept model), or by allowing each family to have a different coefficient on some variable(s), which you can also predict (a random coefficient model). I have no idea how you'd do this in spss. In stata, you could do this with the xtmixed or gllamm commands, or you could use the nlme package for R, or boutique multilevel software.

Even less simple: A three level model with children grouped into genomes grouped into families. Lots of 1-element genome groups, and a few with 2 or more elements in those groups.
posted by ROU_Xenophobe at 9:25 PM on September 18, 2007

Here we go:

which leads to pairwise concordance and probandwise concordance, and googling those terms seems to lead to a whole bunch of statistical stuff, in which I estimate a high probability of you finding a useful answer. :)
posted by aeschenkarnos at 9:36 PM on September 18, 2007

This is one of the rare times when I disagree with ROU_Xenophobe. When you do a medication trial, you should do it according to an intention-to-treat design. That means you randomize all comers to treatment groups A and B and then you look at the treatment outcome for group A and the treatment outcome for group B and see if they're statistically different.

Group B couldn't tolerate their medicine and all stopped taking it? Don't care. Some of the people in Group B live in the same household? Don't care. Group B contains more twins than group A? Don't care. The medicine in Group B was a space alien sex pheromone attractant, causing all members of group B to be abducted by aliens and subjected to rectal probe? Don't care.

All you care about are the differences (if any) in treatment outcome between groups A and B. If causing people to be abducted by aliens makes their allergies better, that's interesting - but this kind of study isn't the way to investigate the biological mechanism of that result.

The nice thing about intention-to-treat is that it exactly mimics what happens in the doctor's (or treating professional's) office. I see a patient, I diagnose them, I develop an intention to treat them with treatment A, and then I send them away with a prescription for treatment A. What happens after that - the treatment outcome - is the only thing I'm interested in learning about treatment A; namely, is it worth giving, or not.

The criticism that could be leveled by a reviewer is that the results of your study are confounded by the presence of too many twins. Since the twins were all randomised and the design is intention-to-treat, you can totally blow that reviewer out of the water.

You missed a chance to embed a twin study in your major study, by only randomizing one twin and then automatically enrolling the other twin in the other group - but that's over and done with. (Identical twin studies are useful because they allow you to test the treatment vs. the control in an identical genetic substrate, and often in a very similar environmental substrate too.) But since you didn't do it there's no point in talking or thinking about it now.
posted by ikkyu2 at 9:47 PM on September 18, 2007 [5 favorites]

Also, by twins you do not mean two people born at the same time. You mean a pair of cogestate siblings.
posted by ikkyu2 at 9:48 PM on September 18, 2007 [1 favorite]

If you're going to generalize the method to future twin studies, be careful how you define your terms: account for triplets (or more), and account for adoption. Shared home address or next of kin notification is likely to identify siblings, shared date of birth is likely to identify twins etc, but adoptive/natural parental status would almost certainly have to be asked for.

Whoops, another problem: milkman factor. The more important genetics are to the study the more essential it will be that you do actual assay rather than survey. There is a higher than insignificant chance of siblings not sharing the same father, even if they share the same mother. Genetic studies that may bring this issue up need to have it considered in their ethical review.
posted by aeschenkarnos at 9:48 PM on September 18, 2007

I like ikkyu2's answer a lot.
posted by aeschenkarnos at 9:50 PM on September 18, 2007

Thanks ikkyu2, that's a really good answer. And yes, I do mean cogestate siblings.

I made a mistake in describing how the children were randomised (I'm asking for a friend, and I made a wrong assumption). Twins were kept in the same group as their other twin. This was because the study design (the intervention was a medication given to breastfeeding mothers, and outcomes are being measured in the kids) made it impossible to do otherwise.

ROU_Xenophobe, I think they'll end up using Stata for that very function.
posted by teem at 10:48 PM on September 18, 2007

Interesting. There are a couple ways to go at this. You could skirt the issue by saying that intent-to-treat was intent to treat the mothers, while treatment outcomes were the kids' outcomes. In that case one patient can have more than one outcome and there are ways to deal with that statistically.

You could also throw out the kid who wasn't randomized. In practice this probably means randomly discarding one of two twins and using the outcome of the other as the only data from that twin set. I think I'd probably do this, because you're sacrificing only a small amount of power in the name of ensuring study validity.

One of the other problems is that presumably although the mother is getting treated, it is the kid that carries the diagnosis in question. Do the twin pairs all carry the same diagnosis in each kid? Or was the diagnosis sort of handwaved in one because of more severe problems in the other kid? Discarding the less-affected twin protects you from this kind of validity-affecting problem.

I guess your other option is this clustering function that ROU_X points out. However, I don't believe that these sorts of mechanistic statistical methodologies reflect a biological reality about the ways that heredity and environment affect response to a medication; they are more along the lines of what an engineer I knew used to call 'bugger factors'.
posted by ikkyu2 at 11:56 PM on September 18, 2007

If I was faced with this I would run both the analyses ikkyu2 suggests. Analysis 1 including all twins seems more "real life" and perhaps more clinically relevant. Analysis 2 with just one of each twin pair, covers your bases with finickety reviewers (although I can't imagine how this plausibly compromises the validity of your study). It is almost certainly going to show exactly the same result as Analysis 2, but if it doesn't, you've got something really interesting to write about in the discussion section of your thesis/paper.
posted by roofus at 3:44 AM on September 19, 2007

This is one of the rare times when I disagree with ROU_Xenophobe. When you do a medication trial, you should do it according to an intention-to-treat design.

This is what happens when someone from a deeply regression-oriented world talks about experiments. We can't run them (very often), so I know little about issues of experimental design. But lots about dealing with the well-buggered data the world presents you with.

Ignore me and listen to the man in the related field. Re-randomizing by throwing away one twin makes sense. If the number of subjects is large, I wonder if it would be useful to randomly discard one twin from each pair, store the results, and then repeat another 1000 to 100000 times to build a density of results over different randomizations?
posted by ROU_Xenophobe at 4:30 AM on September 19, 2007

ikkyu2 is correct, unless this isn't just an efficacy trial. If what you're looking for is how some other factors interact with a treatment (eg that it worked just as well in women), you're doing what's called a mixed-effects or random-effects model. Basically, you assert that there's a random component to the outcome based on genertics, and that twins share more than normal people. Googling brings up several pages on how to do "mixed effects SPSS"
posted by a robot made out of meat at 4:44 AM on September 19, 2007

Random-intercept and random-coefficient models are both examples of random effects models, FWIW.
posted by ROU_Xenophobe at 6:05 AM on September 19, 2007

« Older Everybody poops, especially babies   |   Tell Me About your Favorite Elementary School... Newer »
This thread is closed to new comments.