Data Munge: Uknown combination of 7 numbers produces a given number.
January 17, 2017 10:20 AM   Subscribe

Data Munging Pros: In a spreadsheet with ten data columns, Column "A" is related to columns "B" through "J" in that some unknown combination or interaction of some or all of them will produce the data in Column "A". How do we determine what that combination is? (I can't even find what this type of problem is called, and I've not been able to find any productive Google search terms.) A more thorough explanation/example is provided in-thread.

Quick example, very slightly different than the one above the fold:

Say I have two spreadsheets that I know are related in some way.

Spreadsheet One has two Columns: "Alpha" and "Bravo". An example row for Alpha and Bravo are:
Alpha = 42,492
Bravo = 3,115
Spreadsheet Two has seven fields: "Charlie" through "India". Example rows for Charlie through India are:
Charlie = 39,912
Delta = 4
Echo = 17,664
Foxtrot = 11,313
Golf = 7,719,011
Hotel = 15,084
India = 412
Now in this example, example Alpha (42,492) is the relationship:
[Charlie + Echo - Hotel]
And example Bravo (3,115) is the relationship:
[(Hotel / Delta) + Foxtrot]
Easy enough to do with Two Columns in Spreadsheet One and only seven columns in Spreadsheet Two. But for my problem, I have 5 entries in "One" and I have something like 35 entries in "Two".

What is this type of problem called? What is the way to go about attacking it? Assume that the only operators involved in the relationship will be "add/subtract and multiply/divide".

Thanks so much for your time and your expertise.
posted by jjjjjjjijjjjjjj to Science & Nature (13 answers total) 2 users marked this as a favorite
Sounds like you need a matrix.
posted by notsnot at 10:34 AM on January 17, 2017

If you only allowed constant multiplication and addition, e.g. A= 2B - 0.35C... +5J, then you'd be looking at a linear combination, and it would be relatively straightforward to figure it out.

Linear algebra is where things work nicely, and many students learn to solve these equations of the form Ax=b .

If you can't be certain of the operations permitted, there is no way to solve the problem in a unique manner. You may find a solution that agrees with the given data, but there may be many, and you can't be sure that the next row of data won't invalidate your proposed solution.

If you have a polynomial system, you might be ok, but the division will really prevent bringing any standard college level theory to bear.

Given a few rows, some of us may be able to find the pattern, but there's not any systematic way to deal with solving truly general systems of equations.
posted by SaltySalticid at 10:42 AM on January 17, 2017 [3 favorites]

The problem is the equation you're looking for is not necessarily unique. If you had a linear combination (only addition and multiplication by constants) you could say more about the conditions for a unique solution, but division complicates things.

One thing you could try would be to plot the two sets of data and try to visually guess the relationship between them.

You can also hypothesize an equation relating the two sets of data, then plug your numbers in to get a system of equations and see if what you get is inconsistent or not.

Terms you could search for are "regression" and "parameter determination."
posted by dilaudid at 10:45 AM on January 17, 2017 [4 favorites]

Math people scoff when I suggest stuff like this, but if you know how to script you could just try all the permutations of adding/subtracting/multiplying/dividing.

I mean you can't try all the permutations, but you can try billions quickly. I've written scripts to do stuff like this. YMMV
posted by gregr at 10:48 AM on January 17, 2017 [1 favorite]

if column A is a *linear* combination of columns B-J, then you may have a solution. If it's non-linear (eg polynomial of higher degree, or trig or whatever), you most likely have infinitely many solutions
posted by youchirren at 10:49 AM on January 17, 2017

You could identify likely "ingredient" terms in Spreadsheet Two by measuring correlations between each combination of columns in Spreadsheet One and Spreadsheet Two. This is similar to dilaudid's suggestion of plotting & guessing. But since you have ton of columns, and you could automate (or set up formulas) to measure correlations, this might be a way to narrow down to the relevant terms.
posted by paper chromatographologist at 10:51 AM on January 17, 2017 [1 favorite]

Unknown combination of 7 numbers produces a given number
A is the desired result, B thru J gives you 9 input numbers not 7

If you were looking for say 3 input numbers that sum to the value of A there is a well known solved problem 3SUM that covers it, and the general algorithm people use could be extended to look for 1 match, 2SUM, 3SUM ... 9SUM. But that's just looking for values that sum to the result you want, so is only using the addition operator. What you are asking for is much more complicated. You would need to try all combinations of size 1..9 with all allowed combinations of operators between them, and with the numbers in all permutations.
posted by w0mbat at 10:59 AM on January 17, 2017 [2 favorites]

This is an extremely unconstrained inference problem, far from just "data munging." I don't know of a name or established approach for this exact problem. Is it possible the problem actually has more structure that you're not taking account of? If we knew about that, this might be an easier problem. For example, as others have said, if entries are linear combinations of others, the method of linear regression has attractive properties. ("Regression" is sometimes used to mean, generally, estimating numerical relationships in data, but I haven't seen "regression" methods that solve the problem as you've described it.)

If we are stuck with the problem as described, then exhaustive search is a natural thing to think about.

Accounting for order of operations, how many unique ways are there to combine 7 numerical operands with four binary operators? I can't quite do it in my head. Maybe 2^17ish? So exhaustive search may be practical if you only need to find the relationship for one row. (If all the rows exhibit the exact same relationship, then you should only have to search combinations for a very few rows I think.)
posted by grobstein at 12:06 PM on January 17, 2017 [3 favorites]

You might have some luck with high-powered non-parametric regressions, but honestly I would in no way recommend even trying to learn how to do that, unless you are already highly confident in your math and coding skills.

For example, the MARS technique is pretty cutting edge, and can give you *an* answer. But it won't be unique, and it won't be anything like a nice formula that you can write down in a few lines.

And yes, I'll echo grobstein for clarity: this is FAR beyond standard data wrangling/munging.
posted by SaltySalticid at 12:26 PM on January 17, 2017

This problem seems to be related to a classic problem in computer science, the subset sum problem. In that problem, given a set of integers, the goal is to determine if any subset sums to a given number. In your case, the problem is more difficult, because with allowing multiplication and division as well as addition, the order of operations matters. The bad news is that the subset sum problem (which seems like an easier problem to me), is already known to be NP-complete. That is, it can not be solved "efficiently" -- in polynomial time based on the size of the input.

For that reason, I don't think there is a simple trick to solving your problem for the general case. An exhaustive search could find a correct answer, but this is only feasible for small sets of numbers. For the amount of numbers you referenced (35), I think a program could find a solution in a reasonable amount of time, but writing this type of program would not be easy for a novice.
posted by demiurge at 1:38 PM on January 17, 2017 [4 favorites]

I agree that the problem as specified does not map nicely to any mainstream approach to data analysis, which makes me question the validity of how you've stated the problem itself. As suggested above, you should look for more structure in the definition of the data before deciding on the type of problem you need to solve.

Questions like:

- where did this data come from? How was it generated?
- are you trying to predict new values for unknown data in the future?
- is there an exact functional relationship between these variables? How do you know that?
- why do you think the unknown function must allow for all the operations you suggest?
- do you care about the unknown function itself, or merely the predictions that function will make?

How you answer these questions may help to eliminate certain constraining assumptions you've made about the type of problem you need to solve, allowing for a larger set of possible solutions to be applicable. The "right" way to do this is highly dependent on what your goals are.
posted by grog at 4:55 PM on January 17, 2017 [7 favorites]

If the solution space is too large to check exhaustively, you might have more luck with some sort of evolutionary algorithm.
posted by signal at 7:51 PM on January 17, 2017

Your problem is similar to arithmetic puzzles like this one.

If each expression like (Alpha + November) or (Charlie/Hotel - Foxtrot) is a piece of hay, your problem is to search through an entire haystack of these expressions for the pieces that come out to the numbers you're looking for (42,492 and 3,115 in this case). In your 35 entry case, I'd be willing to bet there's more pieces of hay in that stack than atoms in the universe. Is there a way you can accomplish whatever you're trying to do without solving this problem?
posted by panic at 9:16 PM on January 17, 2017

« Older Working with the elderly. A life for me?   |   Fool me once shame on you, fool me twice shame on... Newer »
This thread is closed to new comments.