Best or most appropriate Statistic method to Use.
January 27, 2015 6:35 AM Subscribe
I have a statistics and/or probability question and the last time I took a statistics class Vanilla Ice and Andrew "Dice Clay" were multi-millionaires.
I am not looking for a problem to be solved, I am asking what statistical technique should I use to determine if a time series of data is due to randomness or not.
For example:
Let’s say I have a list of 10 College Football Teams (names, mostly made up.) For each team, and for each year over the past five years, I have the total number of points scored in the season. So the raw data looks something like this. Data is arranged alphabetically by team name.
Bears – Year 1 points scored = 677; year 2 = 654; year 3 = 691; year 4 = 688, year 5 = 692.
Cats – Year 1 points scored = 692; year 2 = 643; year 3 = 656; year 4 = 650; year 5 = 661.
Ducks – Year 1 points scored = 688; year 2 = 692; year 3 = 678; year 4 = 684; year 5 = 696.
The remaining 7 example teams would have data arrainged in a similar manner.
Falcons, Gorillas, Killers, Mother Boys, Orange Crush, Peacocks, Vandals.
------------------
To repeat and rephrase my question, what statistics or probability technique should be used so I can be able to say, Team X’s performance, is likely not due to randomness, and therefore likely due to something else?
If you really enjoy this type of stuff and would like to go through the steps, that would be great, but there is no obligation.
Thank you
For example:
Let’s say I have a list of 10 College Football Teams (names, mostly made up.) For each team, and for each year over the past five years, I have the total number of points scored in the season. So the raw data looks something like this. Data is arranged alphabetically by team name.
Bears – Year 1 points scored = 677; year 2 = 654; year 3 = 691; year 4 = 688, year 5 = 692.
Cats – Year 1 points scored = 692; year 2 = 643; year 3 = 656; year 4 = 650; year 5 = 661.
Ducks – Year 1 points scored = 688; year 2 = 692; year 3 = 678; year 4 = 684; year 5 = 696.
The remaining 7 example teams would have data arrainged in a similar manner.
Falcons, Gorillas, Killers, Mother Boys, Orange Crush, Peacocks, Vandals.
------------------
To repeat and rephrase my question, what statistics or probability technique should be used so I can be able to say, Team X’s performance, is likely not due to randomness, and therefore likely due to something else?
If you really enjoy this type of stuff and would like to go through the steps, that would be great, but there is no obligation.
Thank you
I'm not particularly clear on your question either. Mostly time series data is used for forecasting (and I'm not sure you have enough of it to do so reliably). Here's an example about predicting baseball batting averages.
Are you trying to see if any one year is an significant outlier? One way is to use Grubbs's test. Here is a calculator. I inputted the Bears scores and none were a significant outlier.
But I think you need another variable if points scored is your measure of performance. For a silly example, what percentage of the team wore lucky underwear in each year? Then you'd be able to test whether the performance had any correlation to lucky underwear or was due to some other factor. It's due to something, unless you are literally picking numbers out of a hat and there are no real-world factors. But I don't see why the time series is important in that case unless you're postulating that (for example) whether or not you wore lucky underwear last year will impact future years.
Anyway, we need more info on what "random" means to you.
posted by desjardins at 7:45 AM on January 27, 2015 [1 favorite]
Are you trying to see if any one year is an significant outlier? One way is to use Grubbs's test. Here is a calculator. I inputted the Bears scores and none were a significant outlier.
But I think you need another variable if points scored is your measure of performance. For a silly example, what percentage of the team wore lucky underwear in each year? Then you'd be able to test whether the performance had any correlation to lucky underwear or was due to some other factor. It's due to something, unless you are literally picking numbers out of a hat and there are no real-world factors. But I don't see why the time series is important in that case unless you're postulating that (for example) whether or not you wore lucky underwear last year will impact future years.
Anyway, we need more info on what "random" means to you.
posted by desjardins at 7:45 AM on January 27, 2015 [1 favorite]
I had yet a third interpretation of your question: I assumed you were trying to find out if teams had a significant trend line (i.e. is this team going up or down random or are they actually going up or down). Your basic tool for that would be a poisson or negative binomial regression (because your DV is a count). How exactly to model it would depend on what you're trying to figure out more precisely:
1. Is it the case that in general teams scored higher (or lower) in later years: Put in year as a continuous independent variable. Add random effect per team.
2. If you want to know if there is something about some years that makes teams score higher or lower: Use year fixed effects (i.e. put every year but one in as a dummy variable). This will tell you if each year is significantly different from the one you left out. To tell if the ones you put in are different from each other either run it again leaving a different one out, or a run a post-estimation wald test.
3. If you want to know if each team has a tendency to go up or down, you can run separate models for each team using method 1. If you need to know if those lines are different from one another (is team X improving more than team Y), you run a model with an interaction term for team and year as your IV. Because teams are a nominal variable, this is actually a bunch of interaction terms (one for each team but one * year). Given that you only have one observation per team per year, I'm pretty sure you wouldn't have the degrees of freedom to do this, even if you had many more teams and many more years.
Now the bad news. If that's all the data you have, it's not nearly enough data to do these things.
posted by If only I had a penguin... at 8:18 AM on January 27, 2015 [1 favorite]
1. Is it the case that in general teams scored higher (or lower) in later years: Put in year as a continuous independent variable. Add random effect per team.
2. If you want to know if there is something about some years that makes teams score higher or lower: Use year fixed effects (i.e. put every year but one in as a dummy variable). This will tell you if each year is significantly different from the one you left out. To tell if the ones you put in are different from each other either run it again leaving a different one out, or a run a post-estimation wald test.
3. If you want to know if each team has a tendency to go up or down, you can run separate models for each team using method 1. If you need to know if those lines are different from one another (is team X improving more than team Y), you run a model with an interaction term for team and year as your IV. Because teams are a nominal variable, this is actually a bunch of interaction terms (one for each team but one * year). Given that you only have one observation per team per year, I'm pretty sure you wouldn't have the degrees of freedom to do this, even if you had many more teams and many more years.
Now the bad news. If that's all the data you have, it's not nearly enough data to do these things.
posted by If only I had a penguin... at 8:18 AM on January 27, 2015 [1 favorite]
Response by poster: Ok, that's why I am here and I will try to explain better.
I know I am adding a hypothetical to a hypothetical, but lets just say that the team that has the highest number of points at the end of a season is the "best team" or the "champ" in this league, win/loss records aside. (The "best" could be applied to any model where the highest total is ranked 1, the next highest is ranked 2, etc.)
Therefore, the team with the highest total points at the end of year 1, is ranked 1 for that year.
Let's just say that for the past 5 years, the Ducks have always ranked #1.
I would like to figure out if the Ducks were just lucky to string 5 years together of being the best each year, or if skill was involved.
For the time being, let's assume enough data exists to come up with an answer.
I believe what I would like to be able to say is, given the total score of each year for the Ducks, I am confident (at a level) the Ducks are skilled and being #1 for 5 years in a row was not just lucky.
So, to use an extreme example I would think that if the Ducks always had total year end points that were 2 deviations higher than the average, then I could say, at a confidence level, that the Ducks are skilled.
I realize that the ideal method might depend on the number of data points I have. Lets just say for now is that the only data points I have are the 10 teams and the points scored by each team for the past 5 years.
If I am making this more confusing, tell me to stop.
posted by otto42 at 8:37 AM on January 27, 2015
I know I am adding a hypothetical to a hypothetical, but lets just say that the team that has the highest number of points at the end of a season is the "best team" or the "champ" in this league, win/loss records aside. (The "best" could be applied to any model where the highest total is ranked 1, the next highest is ranked 2, etc.)
Therefore, the team with the highest total points at the end of year 1, is ranked 1 for that year.
Let's just say that for the past 5 years, the Ducks have always ranked #1.
I would like to figure out if the Ducks were just lucky to string 5 years together of being the best each year, or if skill was involved.
For the time being, let's assume enough data exists to come up with an answer.
I believe what I would like to be able to say is, given the total score of each year for the Ducks, I am confident (at a level) the Ducks are skilled and being #1 for 5 years in a row was not just lucky.
So, to use an extreme example I would think that if the Ducks always had total year end points that were 2 deviations higher than the average, then I could say, at a confidence level, that the Ducks are skilled.
I realize that the ideal method might depend on the number of data points I have. Lets just say for now is that the only data points I have are the 10 teams and the points scored by each team for the past 5 years.
If I am making this more confusing, tell me to stop.
posted by otto42 at 8:37 AM on January 27, 2015
Not an answer to your exact question, but still likely to be of interest, is this paper [PDF]. Particularly the section "The Role of Chance." Sorry, looks like it's an image PDF, so you can't ctrl-F.
posted by If only I had a penguin... at 9:15 AM on January 27, 2015
posted by If only I had a penguin... at 9:15 AM on January 27, 2015
You're still going to need to be specific about what it just being luck means. Like, if it were all luck, would that mean that every team has an equal chance of being #1 in a given year? Or does it mean something else?
If it means that all teams have an equal chance of being #1, then that's easy -- the probability of observing Team 1 of N being #1 X times out of X is just (1/N)^X, and if that probability is low then you can reject that specific null hypothesis. That doesn't mean that the team was necessarily skilled or anything though -- only that you are confident that it's NOT the case that every team had an equal chance at being #1.
posted by ROU_Xenophobe at 10:25 AM on January 27, 2015
If it means that all teams have an equal chance of being #1, then that's easy -- the probability of observing Team 1 of N being #1 X times out of X is just (1/N)^X, and if that probability is low then you can reject that specific null hypothesis. That doesn't mean that the team was necessarily skilled or anything though -- only that you are confident that it's NOT the case that every team had an equal chance at being #1.
posted by ROU_Xenophobe at 10:25 AM on January 27, 2015
Okay, so I think you want to formulate your question like this:
You want to figure out the probability that some team got a certain score given the hypothesis that all points are distributed uniformly randomly, which seems like an application of bayes theorem, similar to this test to see the probability that dice are loaded.
Someone better at math than me can come up with the actual formula.
posted by empath at 11:00 AM on January 27, 2015
You want to figure out the probability that some team got a certain score given the hypothesis that all points are distributed uniformly randomly, which seems like an application of bayes theorem, similar to this test to see the probability that dice are loaded.
Someone better at math than me can come up with the actual formula.
posted by empath at 11:00 AM on January 27, 2015
To repeat and rephrase my question, what statistics or probability technique should be used so I can be able to say, Team X’s performance, is likely not due to randomness, and therefore likely due to something else?
I don't understand what this means. It doesn't make any sense that a team could get a score randomly.
Do you mean that you want to test whether the variation in the scores is explained by some other factor?
posted by clockzero at 11:03 AM on January 27, 2015
I don't understand what this means. It doesn't make any sense that a team could get a score randomly.
Do you mean that you want to test whether the variation in the scores is explained by some other factor?
posted by clockzero at 11:03 AM on January 27, 2015
I'm guessing this isn't really about sports. Maybe it's something to do with your job, and this analogy is not working because people keep thinking in terms of variables that could influence point totals: win/loss records, coaching, athleticism, etc. In real life, sports scores can't possibly be random or no one would ever want to play or watch. It's hard to get out of this mindset. So if the totals really truly could be random, is there another analogy you can think of? Maybe a game where dice are rolled over and over and the values are added up? (In which case, empath has your answer.)
posted by desjardins at 11:12 AM on January 27, 2015
posted by desjardins at 11:12 AM on January 27, 2015
I would like to figure out if the Ducks were just lucky to string 5 years together of being the best each year, or if skill was involved.
But you don't have a measure of skill here, so you can't test for that.
posted by clockzero at 11:31 AM on January 27, 2015
But you don't have a measure of skill here, so you can't test for that.
posted by clockzero at 11:31 AM on January 27, 2015
In real life, sports scores can't possibly be random or no one would ever want to play or watch.
Not random, but the better team doesn't always win. This is the point made in the paper I linked. If the same two teams played an infinite number of games, one would probably win more than the other. Say Team B(etter) wins 75% of the games and Team W(orse) wins 25%. The percentages would depend on how much better one team was than the other.
So if you imagine a series of games, there is some probability that the worse team wins the series. With the percentages above and a 5-game series, the probability is (.25*.25*.25*.75*.75)+(.25*.75*.25*.25*.75)+)...(every combo that gets you 3 wins for team W)+(every combo that gets you 4 wins for team W) + (.25*.25*.25*.25*.25). That's the probability that the worse team wins the series.
However, it seems like the calculations would be different for looking at number of points instead of number of wins. And of course, actually calculating the probability instead of just thinking about how one would, requires the baseline probabilities (Bayesian as someone above said).
So it's not a crazy question to ask if maybe it's just a fluke that one team scores the most more often than others, but I don't know how to calculate said probability of fluke.
Also, yes, if this an analogy, the real question is probably more useful.
posted by If only I had a penguin... at 11:41 AM on January 27, 2015
Not random, but the better team doesn't always win. This is the point made in the paper I linked. If the same two teams played an infinite number of games, one would probably win more than the other. Say Team B(etter) wins 75% of the games and Team W(orse) wins 25%. The percentages would depend on how much better one team was than the other.
So if you imagine a series of games, there is some probability that the worse team wins the series. With the percentages above and a 5-game series, the probability is (.25*.25*.25*.75*.75)+(.25*.75*.25*.25*.75)+)...(every combo that gets you 3 wins for team W)+(every combo that gets you 4 wins for team W) + (.25*.25*.25*.25*.25). That's the probability that the worse team wins the series.
However, it seems like the calculations would be different for looking at number of points instead of number of wins. And of course, actually calculating the probability instead of just thinking about how one would, requires the baseline probabilities (Bayesian as someone above said).
So it's not a crazy question to ask if maybe it's just a fluke that one team scores the most more often than others, but I don't know how to calculate said probability of fluke.
Also, yes, if this an analogy, the real question is probably more useful.
posted by If only I had a penguin... at 11:41 AM on January 27, 2015
If you are asking what I think you are asking, you want to look up autocorrelation. This is a measure of whether a high result this year means next year is likely to also be high, or more likely to be low.
The tool for effects lasting more that one observation is ARIMA, also called Box-Jenkins after the inventors.
The overall topic would be stacastic processes.
posted by SemiSalt at 2:03 PM on January 27, 2015
The tool for effects lasting more that one observation is ARIMA, also called Box-Jenkins after the inventors.
The overall topic would be stacastic processes.
posted by SemiSalt at 2:03 PM on January 27, 2015
This thread is closed to new comments.
Another aside is that absolute number of points scored is a questionable metric of "performance", especially comparing across years.
I would probably either standardize (z-score) within each year or rank the teams each year. And being someone who brute-forces statistics by simulation, I'd probably set up some sort of random shuffle procedure to get a null distribution for number of points per team per season and see how the rankings shake out that way. Might not have enough data to shuffle; might need to see what the real distribution looks like and pull from a random distribution following the same parameters.
posted by supercres at 7:00 AM on January 27, 2015 [2 favorites]