Math/Stats: help me analyze a data set and determine the values that created it
December 23, 2009 12:58 PM   Subscribe

Mathematics / Statistics Filter: I have some pairs of numbers that are the result of a process. Given just that data set, and a rule that relates them, can you determine the integer values that could have resulted in those sets?

Apologies for the phrasing of the FPP -- I know it doesn't make much sense. Hopefully some mathematics / statistics types will click through and see this longer version.

I have some sets of numbers, shown below. I'm trying to reverse engineer the numbers that could have resulted in these sets, based on some known mathematical relationships between them.

In general, a given Device(n) consumes a resource in integer Quantities at a floating point Rate(n), resulting in a total Cost for that consumption run. What I have is pairs of Device/Cost, for several different Device for several runs each, and I'm trying to determine the floating point Rate. The Rate is constant for a given Device. The Quantity consumed is different for each run, but the one key here is that I know that the Quantity values are integers.

So for a given Device run, we have:

Quantity(int) x Rate(float) = Cost(float)

All I have is Cost data for each Device, but I have multiple sets of these and am hoping there's some sort of numeric analysis that can tell me the likely Quantity values that fit.

Here's a sample of the data:

Device / Cost
Device1 / 1235
Device1 / 988
Device1 / 1003
Device1 / 1526
Device2 / 3652
Device2 / 1207
Device2 / 1729
Device2 / 518
Device3 / 745
Device3 / 2115
Device3 / 1415
Device3 / 334

So, for example, using the Device1 / 988 and Device1 / 1003 set, I could eyeball it and see that the Cost difference of 15 is due to 1 unit of Quantity difference in the runs. Thus the first run consumed 66 x 14.97 = 988 and the second run consumed 67 x 14.97 = 1003 . (Alas, the Rate values should be more in the 30-50 range, so 14.97 doesn't make much sense) But I'm hoping that with a larger population of data, there's some analysis I can do that will give a more confident answer.

Perhaps this can even be solved without ensuring that the Quantity values are integers, but it's a constraint that the data is supposed to have so I thought I'd mention it.

I'll monitor this thread for the next couple hours to answer any questions. And will add better tags! I used the science category because this seems like the kind of math that a lab scientist might be familiar with, trying to analyze a data set to work out the conditions that created it. I'm especially hoping for a statistic analysis that produces some sort of confidence measure, because a couple of these data points might be outliers, screwing up what might otherwise be a closed solution.

Note: this is not homework filter, or even do-my-job filter. It's just something I'm trying to reverse engineer.
posted by intermod to Science & Nature (20 answers total)
 
If you are trying to get the rates (and we assume that your costs are without error), it sounds like you're really trying to find the GCD of the costs for each device.

With error in the costs (which must be the case, because your example of 988 and 1003 differing by 15 has the problem that neither 988 nor 1003 are divisible by 15), things get more interesting. I'll have to think about this a little more.
posted by a snickering nuthatch at 1:12 PM on December 23, 2009


The only information you have is the total cost, which device it was, and that the total cost is some integer times some rate?

Can't do. The number of combinations of integers and rate that could give you each device's data is literally infinite.

In your example, the difference you specify could be 1 unit X 14.97 rate, or 2 units X 7.485 rate, or 3 units X 4.99 rate, or 10 units X 1.497 rate, or 100 units X 0.1497 rate. More generally, you'd need to know the N produced, otherwise you can't tell the difference between rate R*N=total or (R/2)*2N=total.
posted by ROU_Xenophobe at 1:13 PM on December 23, 2009 [1 favorite]


Instead of assuming that costs have error, we can still use the GCD approach if we assume that costs are actually of the form quantity*rate+overhead. For example, your eyeball estimate of 15 cost for Device1 could be correct if there was an overhead cost of 13 (plus 15*k for some integer k).

With these assumptions, there is a possible (kinda slow) algorithm that would find the rate. Make a list containing every pairwise difference between the costs for a particular device, then find the GCD of that list.
posted by a snickering nuthatch at 1:26 PM on December 23, 2009


I don't think you can solve this with 100% certainty, but you can make a pretty good guess. It does help that the quantity must be an integer, because that tells you every cost must be an (integer) multiple of the rate.

Here's what I would do:
  1. Gather all the unique values of cost for one device.
  2. Sort the cost values for one device in ascending order.
  3. Calculate the difference in cost between each value and the one before it (include a 0 cost as the first one if it isn't already there in your data.)
  4. Now copy those cost increments and sort *them* in ascending order.
  5. You should see that there's a smallest cost increment, and that all other cost increments are integer multiples of it. Whatever this smallest increment is, that's the rate for that device.
This relies on having enough data for each device that you end up with at least two samples that are exactly one "quantity" different from each other.
posted by FishBike at 1:27 PM on December 23, 2009 [1 favorite]


ROU_Xenophobe- there is a related interesting problem that might be good enough for him, which is to find the largest cost that works for positive integer quantities (which will be unique for a given dataset).
posted by a snickering nuthatch at 1:28 PM on December 23, 2009


Rate is constant for a device. So, if you divide cost by rate and then find the GCF...

But you would still have the issue of whether the difference was for 1 additional unit, etc.
posted by borges at 1:35 PM on December 23, 2009


Approaching it from the other end: You could assume a certain value for the Quantity and then compute a best-fit Rate for that device, along with some sort of error metric (say, minimizing Jpfed's overhead term). Then perform this calculation for every Quantity that makes sense (I assume you have some sort of ballpark idea of what Qty should be), and look for a minimum error. This is kind of inelegant, but it seems like it should be pretty quick to do by computer.
posted by hattifattener at 1:36 PM on December 23, 2009


Because I was bored....

I wrote a quick mathematica script, taking all your values and trying every combination of your values +/- 0 upto 5. Here are the largest common multiples:

device 1: 15
device 2: 18
device 3: 24

code:
(you need to change the values in data to the device you wanted)
data = Flatten[
Table[{745 + x, 2115 + y, 1415 + z, 334 + q}, {x, -5, 5,
1}, {y, -5, 5, 1}, {z, -5, 5, 1}, {q, -5, 5, 1}], 3];

Sort[Table[
GCD[data[[p, 1]], data[[p, 2]], data[[p, 3]], data[[p, 4]]], {p, 1,
14641}]][[-1]]
posted by scodger at 1:59 PM on December 23, 2009


(all the caveats of the above posters apply, plus you would need to check for stability in the errors, and compare largest common factors to a random data set. However the best answer would be to get more data on your other variable)
posted by scodger at 2:25 PM on December 23, 2009


I'm not going to help you but you might want to check out the famous Milikan Oil Drop Experiment. He determined the charge of a single electron back in 1909 by a technique not unlike what you are doing here.
posted by chairface at 2:28 PM on December 23, 2009 [1 favorite]


Response by poster: Thanks for the responses so far! I was counting on people being bored late in the afternoon on December 23rd :)

scodger, when you say +/- 5, do you mean the Cost values? Those are known and at most rounded off by 1. I'd like to assume that means +/- 0.5, but it could be +/- 1.

I'll examine this more closely tonight, in particular think more about the valid range of values for Rate. I said 30-50 in the FPP, but it might be a bit higher, like 40-70. But those results you (scodger) provided do make relative sense, based on what I know about the Devices themselves, so that's a good sign!
posted by intermod at 2:36 PM on December 23, 2009


Response by poster: chairface: exactly!
posted by intermod at 2:37 PM on December 23, 2009


Sure, I have changed the costs values, assuming that while you might have an accurate estimate, there will be some process error in there. Also, it corrects a bit for the fact that I have assumed both quantity and rate are integers (I don't see an easy way of doing anything if one/both are reals unless you have a lot more data).
posted by scodger at 2:54 PM on December 23, 2009


You say that Cost should be a float, but all of the Costs in your example are ints. Is this a coincidence?
posted by mhum at 3:23 PM on December 23, 2009


I could eyeball it and see that the Cost difference of 15 is due to 1 unit of Quantity difference in the runs

Why wouldn't you assume that a cost difference of 15 was due to 15 units of quantity? In fact, in your example, what prevents you from assuming that all the rates were equal to 1? I'm guessing that there's some extra information here. Perhaps if you showed an example with non-integral Costs.
posted by mhum at 3:27 PM on December 23, 2009


How much more data can you get?

With more observations, you can do an histogram of cost occurences for every device. Visual inspection will give you an idea of how quantity is distributed, of the rate, and of a possible process error. Does quantity follow the same law for all devices?
posted by Tobu at 4:17 PM on December 23, 2009


Also, you can make linear combinations of your observations with small, integer coordinates. 745+1415-2115=45, but you tripled the rounding error from ±1 to ±3. Generate all combinations while bounding the error (the vector's norm), and you'll have a more complete graph, and your smallest term that isn't in he error margin of zero should be the rate.
posted by Tobu at 5:23 PM on December 23, 2009


Tobu.
I think that if the error's are independent, then adding them together should less the error in the total. Tending to zero with a very large sample. Assuming error is symmetrically distributed.

Intermod I was wondering do you have any more constraints on either the quantities or the rates. As this would help matters.
posted by 92_elements at 4:05 AM on December 24, 2009


Thinking about it again (why are other peoples problems more interesting than mine...):

We know that Quantity is an integer, so we can do a little bit. Take the Cost, divide by our upper and lower limits of rate (30 and 50, but use 20 and 60 to give room for error). Round these up and down respectively to give our upper and lower Quantity values for one run.

eg. 1235/60 ->20, 1235/20->62

Now divide your cost (1235) by all possible integer values of Quantity (20-62). We now have a list of possible values for the rate. You can now do one of two things:

1. Repeat for the rest of that devices Costs. Round/Bin the numbers, draw a histogram or set intersect and look for the rates that several have, these are putative rates. Take the one of these that requires the least rounding to get integers for all Quantities.

2. Take all your data for that device, and divide each Cost by each possible rate, choose the rate which has the lowest need for rounding. You might want to repeat this on the rates from several values in case your initial choice is an outlier.

For a statistical test, I guess that the amount needed to round for each rate should be normally distributed, you might be able to do a simple z-test on your final minimal rounded value. Alternatively, you could fit the data in a scatter plot and draw a line of regression, calculating the R^2 of the rounded values (non-rounded will be 1) and pick the best fit.
posted by scodger at 11:07 AM on December 24, 2009


Check this out. It is awesome.

http://ccsl.mae.cornell.edu/eureqa

What it is in a nutshell: It is a spreadsheet like thing that you can enter raw data into and using AI it can figure out what equation makes the numbers relate.

There is a two part easy to understand video on the site as well that explains it better than I just did.
posted by santogold at 10:39 PM on December 24, 2009 [1 favorite]


« Older Dog friendly DFW   |   Yoga teacher training ashram in India? Newer »
This thread is closed to new comments.