What's the right method for this?
April 5, 2015 8:53 PM   Subscribe

I am undertaking a fun statistical project, and I need help...

I'm working on a fun little side project that is analyzing the outcomes of a reality show (RuPaul's Drag Race). Competitors lipsync against each other one-on-one in a 'battle' and there is one winner and one loser. The loser goes home, so a competitor may sing 1-4 times depending on their skill.

I have a data set with various characteristics of the performances (e.g. "high energy performance", "does splits"). Most of the variables are yes/no with a couple of integers (e.g. "number of acrobatic tricks"). I've already done logistic regression to identify absolute characteristics of winners and losers, but now I want to look at how various factors interact within the battles themselves (for example, just looking at the data suggests "high energy" tends to win when competing against "low energy").

What's the right method to go about looking for these interactions? I'm working in R, so R-specific answers would be awesome, but really just knowing which method is the correct one should be good enough here.
posted by zug to Science & Nature (8 answers total) 6 users marked this as a favorite
 
So you are trying to see if "high energy" + "does splits" has a better outcome than "high energy" + "rock song"? Are you looking at which combinations are the most winning?

Could you specify at how various factors interact with each other? (What a fun project!)
posted by ichomp at 9:26 PM on April 5, 2015


Can you clarify what you're using as your unit of analysis? Am I understanding that you have on case per performer per battle? If so, you have a dependency problem: Your units are not independent of another.

Putting that aside (and really, you shouldn't put that aside, but let's until you can clarify what the unit of analysis is) what you want are interaction effects. Take your "high energy variable" and multiply it by your "high energy opponent" variable. Plug all three variables into your model. The result is that the coefficient for high-energy is now the association with winning only for performers that had low energy opponents. To find the assocition for high energy performers with high energy opponents, you add the coefficient for high energy and the coefficient for the interaction variable (the variable you made by multiplying).

Note that a p value of less than .05 for the interaction variable means that the difference between the effect of higher energy when up against a low energy vs. high energy opponent is significant. It does not mean that high energy performance against a low energy performance has an effect greater than 0. Do not let people used to working with Anovas help you interpret regression interaction effects. They're different.

Now back to this dependency thing: Here's the problem. If a high-energy performance multiplies your odds of winning by 1.2, then of course this would only matter against a low-energy performance -- otherwise your opponents odds of winning would also be multiplied by 1.2 which would put you back on par. Your problem is that if you have performers as the unit of analysis, there's nothing in the model that constrains things so the predicted probability of winning in any given battle sum to 1 when you add the two opponents predicted probability.

You need to find a way to make the dyads (battles) the unit of analysis, but even then, dyads aren't independent because the same performers are in multiple dyads. So you need dyads and some kind of multi-level model. Most multi-level models would include a single random or fixed effect per case, though. Crossed-effects are close, but dyads are different because crossed effects assume two non-overlapping sets of cases. With dyads the "first performer" and "second performer" are drawn from the same set. So, I think multi-level models for dyadic data are what you want.

Once you get your regressions in that form to correct for the dependencies, use the interaction variables as described above.
posted by If only I had a penguin... at 9:45 PM on April 5, 2015 [3 favorites]


Have you already seen this?
Lipsyncing for your life: a survival analysis of RuPaul’s Drag Race

The author tries to predict survival time or how long contestants survive on the show, which they do by winning battles.

Could you describe what you are trying to do differently from this application of survival analysis?
posted by needled at 5:31 AM on April 6, 2015


Response by poster: I have seen that! That is an analysis of contestant characteristics, I'm looking at an analysis of lipsync characteristics but it's in the same vein as I'm thinking.

I should note that this kind of analysis is far enough out of my wheelhouse that I'm happy to take suggestions. Perhaps the best unit of analysis would be "song tempo" - the winning strategy is highly dependent on whether it's a slow song or a fast song. Another possible unit of analysis could be the lipsync battle itself.

ichomp basically has it. I don't care about how individuals do so much here (that's a well-trodden chunk of analysis), I'm really looking at the characteristics of lipsync performance A vs lipsync performance B, and what combination of factors wins the most a) in each song types and b) strategy vs strategy (high energy vs low energy), for example.
posted by zug at 9:17 AM on April 6, 2015


I don't use R, but I think you've got two options here. I haven't seen the show, so I'm not clear if this is multi-level data or not (is "round" a level two variable?). If it is multi-level data, you'd be looking at using a random slope, random intercept multilevel model as If only I had a penguin has mentioned. Win-lose would be a binary outcome. Hopefully you can start to test out a few models from there.

If it's single level data, Stata can generate an interaction variable, which you can then test using a likelihood ratio test. No idea how to do that in R, sorry, but hopefully that will give you something to google. It'll take you quite a while to test every possible combination of variables to create a final model.

My main query is how you are defining your other variables - are they all binary? Are some categorical or continuous? Based on what? (You don't have to answer that, but I'd be more worried about the model failing due to miscategorised data than anything else).
posted by tinkletown at 9:42 AM on April 6, 2015


Zug: It sounds like "tempo" would be a variable not a unit of analysis. The more I read the more think the unit of analysis is the battle.

Tinkletown: My understanding is that that two levels are contestants and battles. Each battle has two contestants, but the same contestants are in multiple battles. Because each battle is associated with two contestants, regular random effects won't work. Because all the contestants are part of one pool (i.e. the pairings represented in graph form would not constitute bi-partite data) regular crossed-effects won't work. That's why you want the method for dyads. I don't use R either, but the method described above (make a new variable by multiplying) will work in any stats program and that's the way it was done in stata before stata brought in the # and ## functions.
posted by If only I had a penguin... at 11:17 AM on April 6, 2015 [2 favorites]


Ah, that makes sense! I should probably watch the show before jumping in :)
posted by tinkletown at 11:52 AM on April 6, 2015


Response by poster: Thanks for all the help, folks. I'll be mucking with this more this week, but multi-level dyads it is :)
posted by zug at 12:34 PM on April 6, 2015


« Older Does Columbus Fire Department train their own...   |   I need suggestions for fitness programs to build... Newer »
This thread is closed to new comments.