A 'statistically valid sample' is ... ?
February 5, 2015 2:49 PM   Subscribe

I'm developing advice related to a set of standards, specifically about a requirement to review a 'statistically valid sample of assessments' to determine whether an assessment process is valid. Most information I've been able to find relates to surveys and similar activities, which doesn't seem to quite fit this scenario. The questions I keep coming up against are particularly around what the margin of error and confidence level should be in a scenario that isn't a survey or similar.

The sample is of assessment decisions in vocational training - a decision being whether or not a person is competent at a defined set of tasks (defined in a unit of competency). The outcome is binary (either competent or not) for each unit.

The activity in question is to determine the validity of the assessment process by reviewing a random sample of assessment decisions and either confirming or refuting each original decision reviewed, based on the evidence used to make the decision. The review is conducted by a person or a team that have at least the same capability (in both subject matter and assessment skills and knowledge) as the person making the original decision. The outcome of the validation activity is used to determine whether the assessment process is valid overall.

There are lots of 'sample size generators' online and I have been using this one as an example. The issue I'm facing is how to answer questions about what the margin of error and confidence interval should be.

Discussions with a number of people who are experienced and knowledgeable about this validation process indicate that, for a population size (ie number of assessments for a specific unit) of 100, a sample of 30% is 'about right'. Using the calculator linked, a margin of error of 15% and confidence interval of 95% produces this sample.

The concept that I have in my mind (for this scenario only) is that the margin of error relates to expectations that, given the process itself is validated before implementation (or should be), the assessment process should produce valid results assuming it is implemented correctly (ie the outcome is not purely random) and that, barring any specific risk factors, the confidence interval can be left at the default.

I know this is all very unscientific and vague. The requirement for the 'statistically valid sample' is new and I have to try and explain (not to mention understand myself) what is an 'acceptable' sample in this context. I think my actual question is 'in the scenario described above, what do the margin of error and confidence interval actually represent?'
posted by dg to Science & Nature (7 answers total) 5 users marked this as a favorite
Best answer: I'm not 100% clear, but it sounds like what you're doing is a test/re-test analysis to measure the validity of your testing process. Since your data is binomial, you could measure this with a Cohen's Kappa coefficient. Is that what you are doing? Or do you have some other metric or test that you are using to evaluate your validity?

If I am reading it correctly, the sample size calculator you provided is giving the sample size needed to estimate what percent of people in a population would have a given value (yes/no). It doesn't appear related to test/re-test.

I have not found an on-line calculator that will provide sample sizes for Cohen's Kappa, though some stats packages do include such a function. If you have a stats package, maybe it does? If not, here is a site which provides a table with some sample sizes based on how much error you can tolerate in your measurement of Kappa, and how big you expect Kappa to be. If you want to get into this further, here is a paper that discusses this in more depth: [PDF].
posted by agentofselection at 6:12 PM on February 5, 2015

The same author as the first link I posted also posted this which provides an even easier-to-use article and table.

Incidentally, I think the number of people evaluated total is a bit of a red herring here. If what you want to do is measure how much two independent evaluators agree with each other, then the total number of people evaluated by evaluator 1 (but not evaluator 2) doesn't matter.
posted by agentofselection at 6:29 PM on February 5, 2015 [1 favorite]

Response by poster: Thanks - that provides some useful background. I don't think I've explained very well what I'm trying to find out, though.

The problem is how to determine what size random sample should be taken from the decisions made so that the results of reviewing those decisions (ie the sample) can be applied to to all of the decisions sampled from (ie the population). So, if there have been 100 decisions made for a particular unit, how many of those decisions need to be reviewed in order to apply the outcome to all 100 decisions? If there are 1,000 decisions, how many need to be sampled, etc?

I think it matters (but could well be wrong) that there are only two outcomes for the original decision (candidate competent or not) and only two outcomes from the validation process (decision valid or not) and that the outcome of the validation process is biased towards a 'decision valid' outcome, as there are significant factors in the assessment process that lead to a much higher likelihood of the decision being valid than not valid.

In the sense that the validation process is, to some extent, a re-assessment based on the same evidence, this is a test/re-test process. However, that side of things is taken care of in the validation process itself. What I'm trying to develop is a sound method of determining the sample size itself. Based on moderation with enough people and enough population sizes to be confident, the on-line tool that I linked does that if the variables (margin of error and confidence interval) are set correctly. I'm not confident, though, that those two variables as they are labelled in the tool actually do, in this context, what the labels say they do.
posted by dg at 7:32 PM on February 5, 2015

Best answer: A few things -

- this is a non-statistical sample that you're trying to get. to have a statistical sample you would have to use software that would calculate exactly which items you need to look at in order to give you a statistically accurate result

- confidence interval - I think 95% is reasonable. you always have sample risk when it comes to testing and validating something using a sample. 95% confidence level is very good.

- margin of error - 15% seems rather high to me, but it's fine if that's what you're comfortable with. as they pointed out, are the answers you're testing split between competent or not or are they heavily leaning and most are competent (or not)? note their suggestion that if the answers are somewhat evenly split, then you should have a lower margin of error.

- number of items - remember that this depends on how you're doing your testing. for example, if you have 100 assessments and each one has 50 questions and you plan to consider each sampled item to be 1 question in 1 assessment, then your population size is 5,000. however, if you plan on looking at the entire assessment as each sample, then your population is 100. you didn't ask about this, but I just wanted to cover it in case you weren't sure on how this part works.

I hope that I've helped. I apologize for the lack of caps in my sentences, I have worked a long day and I'm very tired.

posted by Georgia Is All Out Of Smokes at 8:26 PM on February 5, 2015

Response by poster: Thanks :-)

Margin of error - yes, it would normally be expected that the vast majority would be 'competent'. Number of items - each assessment decision is one item, not the individual assessment components.
posted by dg at 8:40 PM on February 5, 2015

Best answer: So Uncle Bob's Training School is doing training stuff and they do assessment stuff to say that 80% of the people they train to weave baskets underwater are in fact competent at it. Other people are reviewing the same assessment evidence and maybe some more in order to keep Uncle Bob honest and they found that 60% are competent in a sample of size N. What you want to know is, how big a sample do the Double Checkers need to take to be confident in their work, because if they're obviously wrong in some way then bureaucrats will come screaming out of the sky and haul everyone off in their horrible talons. Is that about right?

The concept that I have in my mind (for this scenario only) is that the margin of error relates to expectations that, given the process itself is validated before implementation (or should be), the assessment process should produce valid results assuming it is implemented correctly (ie the outcome is not purely random) and that, barring any specific risk factors, the confidence interval can be left at the default.

You are overthinking this and underthinking this at the same time. Which is normal.

First, I think you're mixing up confidence levels and intervals. A confidence interval is a probabilistic statement about a population made from a sample, and is the same thing as a margin of error. A confidence interval might say that the population value is 60% +/- 15%, or 45-75%. Or, if the margin of error is smaller or confidence interval tighter, 60% +/-8% or 52-68%. A confidence level is how broad a confidence interval you pick -- you might pick a 90% one, or a 95% one, or a 99% one. Which confidence level is the right one is not a statistical question -- it depends on what's at stake, and how much things cost, and so on.

The margin of error relates to sampling error. That's it and that's all. There's some true population value out there for the proportion of Uncle Bob's graduates that really and truly are competent basketweavers. Assuming you treat the results of the Double Checkers as gospel truth, you can only know that population value by double checking everyone. So you take a sample. The margin of error is just a statement about which population values could reasonably lead to the sample you actually observed, and which population values are so far away from the sample value that it would be almost impossible for the sample to result from them. So if you find 60% +/- 15%, that's just saying that a sample of 55% or 50% or 46% might reasonably lead to the sample of 60% you observed, but a sample of 44% or 40% would be sufficiently unlikely to lead to that sample of 60% that you can ignore that possibility for the time being. And, there, "might reasonably lead" depends on what confidence level you choose.
posted by ROU_Xenophobe at 9:24 PM on February 5, 2015

Response by poster: Thanks ROU_Xenophobe, you've pretty much hit the nail on the head (except 'underwater knitting' is the usual nonsense term we use for units ;-)). I did suspect I might be under and/or over-thinking this all along. That reminds me that I need to sharpen my talons.

So, in context, it seems that confidence level can be applied to the level of risk involved in that, for a unit on answering the telephone, the potential consequences of someone being deemed competent when they aren't are relatively low (so a lower confidence level could be applied) where the consequences of someone being deemed competent at operating heavy mining equipment when they aren't are much greater (so a higher confidence level would be appropriate). Also in context, the higher the proportion of outcomes that are the same (eg 90% competent), the higher the margin of error you can tolerate, although possible consequences should also be factored in here.

I think?
posted by dg at 1:26 AM on February 6, 2015

« Older How would you handle this car repair problem?   |   3 day notice grar Newer »
This thread is closed to new comments.