May 22, 2009 9:23 AM Subscribe

How does statistical analysis differ when analyzing the entire population rather than a sample?

I need to do some statistical analysis on legal cases. I happen to have the entire population rather than a sample. I'm basically interested in the relationship between case outcomes and certain features (e.g., time, the appearance of certain words or phrases in the opinion, the presence or absence of certain issues).

Should I do anything different than I would if I were using a sample? For example, is a p-value meaningful in this kind of case?

If it matters, the population is large (many thousands of cases) and spans several years.
posted by jedicus to Science & Nature (9 answers total) 4 users marked this as a favorite

I need to do some statistical analysis on legal cases. I happen to have the entire population rather than a sample. I'm basically interested in the relationship between case outcomes and certain features (e.g., time, the appearance of certain words or phrases in the opinion, the presence or absence of certain issues).

Should I do anything different than I would if I were using a sample? For example, is a p-value meaningful in this kind of case?

If it matters, the population is large (many thousands of cases) and spans several years.

Often in doing statistical analysis, the goal is to make statements about the entire population. The use of smaller samples is usually necessitated by lack of data on the full population, so statistical measures are used to test how likely it is that the results from analysis of the sample are representative of the full population. When you actually have data for the full population, no such tests are necessary.

The "p-value" statistic is a measure of how likely you would be to observe a sample like yours if the parameter for the population were something particular (usually, testing whether a particular population parameter has a particular sign, or is non-zero). So if you really do have the full population of cases in which you are interested, then you need not worry about testing how representative your sample is likely to be, and the "p-value" becomes meaningless.

So the important question is whether you actually have the full universe of data in which you are interested. If you're simply trying to show that,*historically*, courts more often granted motions to dismiss when certain concepts have been raised in the briefs, or that cases that took longer to reach decisions more often rendered judgment for plaintiffs, etc., then you may simply report the actual population statistics. But if you are interested in making predictions about future cases, then you may not actually have the entire population in which you are interested - rather, you may have a mere sample of *potential* or *possible* cases.

posted by dilettanti at 10:01 AM on May 22, 2009 [1 favorite]

The "p-value" statistic is a measure of how likely you would be to observe a sample like yours if the parameter for the population were something particular (usually, testing whether a particular population parameter has a particular sign, or is non-zero). So if you really do have the full population of cases in which you are interested, then you need not worry about testing how representative your sample is likely to be, and the "p-value" becomes meaningless.

So the important question is whether you actually have the full universe of data in which you are interested. If you're simply trying to show that,

posted by dilettanti at 10:01 AM on May 22, 2009 [1 favorite]

A niggling little detail from my previous post that I need to correct:

You wouldn't get a "beta value" by running a Power test -- you would get (durr) the Power. A "beta error" is the same as a "Type II error". If the Power of your test is greater than 0.9, then your risk of committing the aforementioned error is 0.1.

posted by jpolchlopek at 10:04 AM on May 22, 2009

You wouldn't get a "beta value" by running a Power test -- you would get (durr) the Power. A "beta error" is the same as a "Type II error". If the Power of your test is greater than 0.9, then your risk of committing the aforementioned error is 0.1.

posted by jpolchlopek at 10:04 AM on May 22, 2009

I guess I should also mention that goodness-of-fit tests, like R-squared in the context of linear regressions, are still relevant for incomplete models in the full-population context, because they are measures of how well your particular model works in explaining the observed data. The way such measures are calculated may change when you know you have the full population, but they are still relevant and meaningful.

posted by dilettanti at 10:19 AM on May 22, 2009

posted by dilettanti at 10:19 AM on May 22, 2009

There are two sources of randomness (or things people think are random) in samples.

Some people (esp. people who work on complex samples, eg clustered stratified multistage PPS/systematic samples etc) think mostly about who got into the sample as random. There is a true, finite, population and there is a relationship between variables in that population. If your interest is to say what that relationship is, when you have the full sample*nothing is random*. When you have a subset of that, the random thing is *who you selected to be in the subset*. These are sample-design based statistics.

Almost never, if you really think about it, do you have the whole population of interest. You may be interested only in speaking to what happened in those court cases you have data on, but probably you're interested in speaking to what would happen in cases scheduled tomorrow, or scheduled in a court beyond the scope of your data but similar to ones that are in your scope. In that case you have a sample of the larger set {cases in jurisdiction X over the next five years to five years ago}.

On the other hand, you can think about the outcome as random. If there are lots of determinants of the outcome that you can't measure, it's not so bad to think about it that way. If you are thinking about the outcome as random, now there are two sources of randomness: the design and the events. In this case people mostly think about modeling both the sampling and the correlation between events (if it exists).

posted by a robot made out of meat at 10:30 AM on May 22, 2009 [2 favorites]

Some people (esp. people who work on complex samples, eg clustered stratified multistage PPS/systematic samples etc) think mostly about who got into the sample as random. There is a true, finite, population and there is a relationship between variables in that population. If your interest is to say what that relationship is, when you have the full sample

Almost never, if you really think about it, do you have the whole population of interest. You may be interested only in speaking to what happened in those court cases you have data on, but probably you're interested in speaking to what would happen in cases scheduled tomorrow, or scheduled in a court beyond the scope of your data but similar to ones that are in your scope. In that case you have a sample of the larger set {cases in jurisdiction X over the next five years to five years ago}.

On the other hand, you can think about the outcome as random. If there are lots of determinants of the outcome that you can't measure, it's not so bad to think about it that way. If you are thinking about the outcome as random, now there are two sources of randomness: the design and the events. In this case people mostly think about modeling both the sampling and the correlation between events (if it exists).

posted by a robot made out of meat at 10:30 AM on May 22, 2009 [2 favorites]

Apologies, but I would say that most statisticians would question the utility of a post-hoc power calculation in a scenario like this. Power calculation plays a role in study design and implementation prospectively. With what is a retrospectively already obtained, and externally limited data set in terms of size, once the study has been conducted power calculation largely loses any meaning beyond the information you already obtain from hypothesis tests resulting in confidence intervals that include the null. When that occurs you already know you were "underpowered" to detect the point effect size you have found at the chosen alpha. Post-hoc power is always going to be inversely related to the p-value and size of the confidence interval. In this context, one can use the results from the initial cohort to perform a so-called "reverse power" calculation which would yield an estimate of how many additional data points might be required to narrow down confidence intervals sufficiently, but with a dataset so large and without the prospects for additional sources of data points the question is largely academic and typically would arise when the actual magnitude of an effect size of interest becomes exceedingly small.

posted by drpynchon at 11:31 AM on May 22, 2009

What you seem to be saying is that a) since the original poster has a complete or near-complete set of data, and b) since the P-value will be large, any post hoc power calculation (i.e. run a Power test on your "sample" size and results) is pretty much pointless.

In the Google search you linked to, one of the articles states that post hoc power calcs are "irrelevant and misleading", though I'm not sure why.

Apologies, myself, but I'm just not clever enough to parse your sentence into something I can understand. Can you clarify for me?

posted by jpolchlopek at 11:44 AM on May 22, 2009

A power calculation asks "if the effect size were A, with a design S, what's the probability that I will get a significant result?". If you have S, calculate A, and it's not significant then a power calculation will always tell you that you needed more cases. But that isn't true; the effect might really not be there.

posted by a robot made out of meat at 12:37 PM on May 22, 2009

posted by a robot made out of meat at 12:37 PM on May 22, 2009

Sorry for the second post and bit of derail. With respect to power analysis, post-hoc calculations (using the variance and effect size estimates from your already existing sample) do nothing other than restate the information that your p-value and confidence intervals provide. If p = 0.05 exactly for a given hypothesis test, then the calculation for the resulting point estimate of the effect size will suggest 50% power at an alpha = 0.05. If p > 0.05, for the same point estimate, the power will be even less than 50%. By the same token and perhaps more appropriately, for any test with a p > 0.05 one can draw the conclusion that with 95% confidence, *if an effect does exist* the magnitude of such an effect is less than the bounds of your 95% confidence interval. The greatest utility of power analysis is really in justifying the conduct and resource allocation of a trial in the first place while in the planning stages. The NIH isn't going to give me funding for a project if the power of such a study to yield a meaningful result is poor. But if the data is already collected and analyzed that question is moot. For further explanation this link and the references at the bottom are a decent start.

Now to the OP, theoretically, even if you do have all the available data points in existence at the current time, one way to look at your data set would be as a sample from a larger time or geographic frame (for example one that includes future cases not yet in existence). Of course that virtually guarantees sampling bias in any generalizations one might make.

Were I in your shoes I might consider taking your large sample and breaking it up into parts. A sample this large allows for the creation of a prediction model using a variety of statistical techniques (logistic regression, classification and regression trees, etc.) in a reasonably sized subset you might call a derivation cohort. This is where things like p-values might be more meaningful (in the model selection process combined with goodness-of-fit statistics perhaps), after breaking up your total cohort. The advantage of looking at a sub-sample is that once you create a model, you now have a second built-in cohort for external validation to which you can apply your model and perform things like ROC analysis yielding concordance-statistics. This will tackle the question of not just whether associations are statistically significant (p-values) in the particular derivation cohort but how well such associations do in predicting outcomes in other cohorts.

If your goal is to create a quantitative prediction model that you might ultimately want to use for future legal cases, that is how I'd go about things. For an example from the medical literature, see this study on heart failure admissions. If your goal is only to describe what went on in the total population of cases you have and nothing else, then throw out the hypothesis-testing all together and focus on a careful exploration of descriptive and goodness-of-fit statistics for any models you propose.

posted by drpynchon at 3:34 PM on May 22, 2009 [2 favorites]

Now to the OP, theoretically, even if you do have all the available data points in existence at the current time, one way to look at your data set would be as a sample from a larger time or geographic frame (for example one that includes future cases not yet in existence). Of course that virtually guarantees sampling bias in any generalizations one might make.

Were I in your shoes I might consider taking your large sample and breaking it up into parts. A sample this large allows for the creation of a prediction model using a variety of statistical techniques (logistic regression, classification and regression trees, etc.) in a reasonably sized subset you might call a derivation cohort. This is where things like p-values might be more meaningful (in the model selection process combined with goodness-of-fit statistics perhaps), after breaking up your total cohort. The advantage of looking at a sub-sample is that once you create a model, you now have a second built-in cohort for external validation to which you can apply your model and perform things like ROC analysis yielding concordance-statistics. This will tackle the question of not just whether associations are statistically significant (p-values) in the particular derivation cohort but how well such associations do in predicting outcomes in other cohorts.

If your goal is to create a quantitative prediction model that you might ultimately want to use for future legal cases, that is how I'd go about things. For an example from the medical literature, see this study on heart failure admissions. If your goal is only to describe what went on in the total population of cases you have and nothing else, then throw out the hypothesis-testing all together and focus on a careful exploration of descriptive and goodness-of-fit statistics for any models you propose.

posted by drpynchon at 3:34 PM on May 22, 2009 [2 favorites]

This thread is closed to new comments.

The lesson we got is that you almost never have the true population, you always have a sample (even if it is a VERY large sample).

I would run my analysis as if I was still using a sample, and be sure to run Power tests afterwards. Its entirely possible that, even given your very large "sample", you won't be able to get results with a Beta value of greater than 0.90 (i.e. your risk or committing a Type II error [unable to detect change even when a change is present] is less than 10%).

posted by jpolchlopek at 10:00 AM on May 22, 2009 [1 favorite]