Calculating the probability of a scenario that might happen by chance
May 8, 2015 10:37 AM   Subscribe

I have to validate some input for a database, and I want to present the user with a mathematically accurate estimate of the percentage of data that they entered which may be invalid. The problem is that valid data looks like invalid data 10% of the time.

I'm trying to validate UPCs. The 12-digit UPCs you see on a normal barcode always end with a check digit, which is obtained using a simple formula with the other 11 digits as input. These UPCs also always start with one or two zeroes, which will most likely be stripped if they've ever seen Excel at any point (so I can't rely on length to determine if a check digit is present). I will have to prepend zeroes to the input so that each UPC is 11 digits. I need to store the UPCs without their check digits.

The question is, what is the probability that the user has given me UPCs with check digits? I can calculate what the check digits would have been and compare to what the inputted UPCs actually end with, but that has a 10% chance of matching just by coincidence, since there are 10 possible check digits which are distributed evenly.

Let's say I receive this as a batch of input data from a single user.
100 total UPCs.
I calculate that 30 have valid check digits.
Statistically, what is the most likely number of the 100 UPCs actually having check digits? I know that some of them must, since if all 100 UPCs were valid, I should only see around 10 with matching check digits. I don't want to confuse people by showing them which UPCs' check digits are matching (since it's hard to explain the concept of 10% might match by coincidence without the multiple paragraph explanation that I just gave) - I just want to say, "Please check your input. It looks like x out of 100 UPCs have check digits."
posted by zixyer to Technology (12 answers total)
 
Let me try and rephrase your problem to ensure I understand it correctly:

- A user is reading and entering a list of numbers which are UPCs
- these numbers may (or may not) be missing their leading zeros
- you have told the users to enter these UPCs without check digits
- you want to attempt to indicate to a user when they've made a data entry error

My question to you is: given the first three items, how is a USER to know if the last digit is a check digit or not? Do the list of numbers ALWAYS have a check digit present?
posted by scolbath at 10:59 AM on May 8, 2015


My intuition says you will not be able to say much in this situation. I don't have time to work it out at the moment, but it shouldn't be too hard. First thing to do is to think about two different statistics -

1) The confidence that you have correctly determined the nature of the UPC
2) The confidence you have that the UPC is correct

As I read your question if you get 12 digits you are 100% confident it is a correct UPC and you can determine with some confidence using the parity digit if it is correct.

For 10 and 11 digits you have a couple of different options to determine the correct nature of the UPC - based upon those different options you may have a parity digit.

So I would enumerate all of the options and start computing the expected percentages for each category.

I suspect in the end the best you could do is report: "There is a x% chance that y% of the UPCs you have entered are incorrect." More specifically, you could report each category which might provide either too much information to the end user, or provide them with enough confidence to ignore the warning when it's not relevant..
posted by NoDef at 11:01 AM on May 8, 2015


Response by poster: "There is a x% chance that y% of the UPCs you have entered are incorrect."

Yeah, this is what I'm trying to arrive at. I will probably only display the message when the x% meets a certain threshold so that they don't see it unless there is a high probability that there is a problem.

I just remember probability questions like this being immensely difficult from when I took statistics in college, so I was hoping to get some different approaches to the problem here.

how is a USER to know if the last digit is a check digit or not

They are putting UPCs in that they get from a variety of different sources. They're likely in a much better position to know why some of their UPCs have check digits than I am. For example, perhaps one report that they use always includes check digits, but they are inputting information from this report along with information from other reports that don't include the check digits. I should be able to detect an abnormally large percentage of check digits in their input, but I'm not in a position to correct them.
posted by zixyer at 11:14 AM on May 8, 2015


What I'm trying to get at is that it you may be generating a message that the user can do nothing with. Let's assume that they have entered the data correctly - then what? Or are you just trying to get them to double-check their work?
posted by scolbath at 11:28 AM on May 8, 2015


Response by poster: I should add that the users are savvy enough to know about check digits and why they are a problem, but aren't likely to appreciate a dissertation on probability in an error message.
posted by zixyer at 11:29 AM on May 8, 2015


I am not a mathematician.

In a batch of numbers, there could be an error.

There has to be two or more errors for a number to be wrong and still pass your imperfect validation test.

So, based on the size of the batch, you can get a estimate of the error rate of the batch. If you get one signaled error out of 100, there is about a 1/100 chance there is one error in any of the others, and 1/100*1/100 that there are two errors in the same number, times a 1/9 that it matches. So, I reckon that you should expect such a batch to have a second undetected error 1 out of every 90000 times.

extraerrorcount = batchsize * 9 * (1 / discoverederrorcount)^2) ## simple two-error assumption

> Saying "x out of 100 UPCs have check digits"

There's a good chance no one knows what a check digit is, and it doesn't matter. Having it or not is not the problem, but whether the UPC is valid. Say "In this batch, (x) of the price codes are known to be invalid. Based on that error rate, it's likely that (n) others are also invalid."
posted by cmiller at 11:30 AM on May 8, 2015


Response by poster: Let's assume that they have entered the data correctly - then what? Or are you just trying to get them to double-check their work?

Trying not to threadsit, and I get your point that the message would have to be carefully written. What I want the users to do is validate all of their data sources and make sure that none of them are including check digits. The reports they are getting the data from will probably have some indication of what type of UPCs are included.

The message would be something like (overusing the passive voice a bit): "From an analysis of the UPCs, it has been determined that about 31 of them include check digits. Please check your data sources and validate that you are entering UPCs without check digits."
posted by zixyer at 11:46 AM on May 8, 2015


Best answer: As I understand the question, you want the user to enter the data without the check digit.
Instead of an error message saying "x out of UPC have check digits" why not just be more accurate and say "x codes have been flagged as having possible check digit problems, please review the data to determine if there is a problem"

That avoids all questions of probably and encourages the user to double check the obviously buggy data without saying that all data flagged as having check digits are necessarily wrong.
posted by metahawk at 11:47 AM on May 8, 2015 [1 favorite]


I agree with Metahawk. If you can't give the user a bit more information to go on, you are really just going to p*ss them off. If you do nothing else, you really should at least highlight what you believe to be suspect data rather than saying 'hey, some random set of data may or may not have issues'.

Here's a suggestion: if you have a finite set of reports that the data comes from, and you know that reports from TPS International always have check digits, where as reports from Vandelay Industries never do, why not have the users select which report they are entering data from, THEN have the data EXACTLY AS IT APPEARS IN THE REPORT, check digit or otherwise. This would greatly increase the fidelity of your error checking and data normalization.
posted by scolbath at 11:53 AM on May 8, 2015 [1 favorite]


It sounds like you're just trying to inform the user that "x% of records appear to have a valid check digit. 10% was expected."

Is there any reason you aren't just running a straight percentage? Get a few sample data sets and check the variance for your 10% empirically. I think that's what you're really getting at -- you're trying to get a confidence value for the data set having check digits, but you can't do that based off of a modeled 10%. Run the test sets, get your acceptable variance, and set a tolerance for how high the straight % of potential check digits can be.
posted by bfranklin at 1:03 PM on May 8, 2015


Response by poster: Yeah, I think this is more of an interesting probability problem that arose in my work than something that I can put to use in any way. There's no way I can write the error message that won't result in me either getting a bunch of calls that require me to explain confidence intervals and the number of numerical digits in the decimal numbering system, or giving the user so little information as to be useless.

Just to satisfy my curiosity, I figured out that the probability for a particular number of collisions x with the number of UPCs in the batch as n is given by
C(x, n) * (1/10)^x * (9^10)^(n-x)
After putting that in a spreadsheet and doing a little brute force mathematics, the answer to my original question ended up being pretty obvious. It seems that most likely number of UPCs with check digits given batch size n and x collisions is
x - floor(n / 10)

posted by zixyer at 1:34 PM on May 8, 2015


Response by poster: There are a couple errors in the first equation above. It should be modified to the following.

C(n, x) × 110x × 910nx
posted by zixyer at 4:06 PM on May 8, 2015


« Older Form Letter Software needed   |   Are there any tv fan sites split up by the season... Newer »
This thread is closed to new comments.