June 14, 2009 5:16 AM Subscribe

Optimisation problem: trying to find the subset of my data that meets a couple of constraints.

I have a data set with 200 columns and 20 rows of numbers. I am trying to find the 20 columns that make the mean of all values in the resulting 20 x 20 data set as large as possible--subject to the constraint that no row in the 20 x 20 set can have a mean value below a particular figure.

I have access to a range of software tools (Excel, Mathematica, SPSS). I know how to specify the constraint if I were going for a brute force solution, but obviously that's not going to be feasible here. All suggestions greatly appreciated. . . . .
posted by muhonnin to Science & Nature (18 answers total)

I have a data set with 200 columns and 20 rows of numbers. I am trying to find the 20 columns that make the mean of all values in the resulting 20 x 20 data set as large as possible--subject to the constraint that no row in the 20 x 20 set can have a mean value below a particular figure.

I have access to a range of software tools (Excel, Mathematica, SPSS). I know how to specify the constraint if I were going for a brute force solution, but obviously that's not going to be feasible here. All suggestions greatly appreciated. . . . .

With a relatively small matrix, a brute force solution would probably work fine. This would be annoying to write in Excel, but seems pretty straightforward in any programming language that can handle arrays or matrices well. I don't have the budget to use Mathematica, but I wrote up a quick solution in R below (using random data). Haven't done any debugging, but I don't think there are huge problems with it.

# set up some parameters

number.of.columns <> number.of.rows <> small.matrix.size <> minimum.row.cutoff <>

# make some random data

m <>

# initialize the results vectors

min.row.mean <> matrix.mean <>

# calculate the averages

for (i in 1:length(matrix.mean)) {

small.m <> min.row.mean[i] <> matrix.mean[i] <> }

# find the solution

max(matrix.mean[min.row.mean > minimum.row.cutoff])

posted by eisenkr at 6:53 AM on June 14, 2009

# set up some parameters

number.of.columns <> number.of.rows <> small.matrix.size <> minimum.row.cutoff <>

# make some random data

m <>

# initialize the results vectors

min.row.mean <> matrix.mean <>

# calculate the averages

for (i in 1:length(matrix.mean)) {

small.m <> min.row.mean[i] <> matrix.mean[i] <> }

# find the solution

max(matrix.mean[min.row.mean > minimum.row.cutoff])

posted by eisenkr at 6:53 AM on June 14, 2009

hmm... that code got butchered via the posting mechanism. I'll mefi mail you the code, hopefully it won't get destroyed that way.

posted by eisenkr at 6:54 AM on June 14, 2009

posted by eisenkr at 6:54 AM on June 14, 2009

Do you mean the mean across rows, columns, or both?

posted by a robot made out of meat at 8:44 AM on June 14, 2009

Oh, I guess that it doesn't matter.

posted by a robot made out of meat at 8:46 AM on June 14, 2009

posted by a robot made out of meat at 8:46 AM on June 14, 2009

How pathologic is the dataset? Are the column means close? Variances? Are the rows values roughly evenly distributed? If you just use the 20 biggest columns how many of your rows violate the minimum condition? I was starting to think of guided searches for good columns, but the strategy depends on what the data look like.

posted by a robot made out of meat at 8:58 AM on June 14, 2009

posted by a robot made out of meat at 8:58 AM on June 14, 2009

eisenkr: the problem is that there are choose(200,20) = 1.6e+27 possible choices of columns.

posted by a robot made out of meat at 9:00 AM on June 14, 2009

posted by a robot made out of meat at 9:00 AM on June 14, 2009

Well, the mean (really the sum, save yourself a division) of your matrix is linear in the mean (sum) of the columns. If it wasn't for the constraint on your rows, you'd get the answer by just picking the 20 columns with the highest sums.

I agree with a robot made out of meat that it depends on the nature of the data, in particular, how common problematic rows are.

posted by atrazine at 9:42 AM on June 14, 2009

I agree with a robot made out of meat that it depends on the nature of the data, in particular, how common problematic rows are.

posted by atrazine at 9:42 AM on June 14, 2009

See, I was thinking that you could brute force it by starting with the greatest possible sum, then working your way down and checking for your constraint at each combination of columns until you find a match, but whether that would work depends on how far you'd have to go before hitting a matching combination.

Even that's not so easy because it's much easier to find the maximum than it is to find the second highest.

posted by atrazine at 9:47 AM on June 14, 2009

Even that's not so easy because it's much easier to find the maximum than it is to find the second highest.

posted by atrazine at 9:47 AM on June 14, 2009

I've been making progress and have been able to discard some data so that I now have 45 columns and 15 rows, but this still isn't enough to allow a brute force solution. The data are fairly problematic; looking at the top 1000 or so combinations of 20 columns, I get no fewer than 3 rows (and not always the same 3) which violate the constraint. I'm focusing now on seeing whether there is *any* solution which meets my criteria, rathr than finding the best one.

If I wanted to spend a couple of days on it I could try to remember how to code up a genetic algorithm to search for a solution, but in practical terms I can probably only afford to spend a few more hours. If anyone can point me to some resources on other methods to solve this*type* of problem, I'd be obliged.

posted by muhonnin at 9:59 AM on June 14, 2009

If I wanted to spend a couple of days on it I could try to remember how to code up a genetic algorithm to search for a solution, but in practical terms I can probably only afford to spend a few more hours. If anyone can point me to some resources on other methods to solve this

posted by muhonnin at 9:59 AM on June 14, 2009

Some pseudo code:

It should perform slightly better than naive brute force by focusing attention on the worst columns

posted by chrisamiller at 10:28 AM on June 14, 2009

Choose 20 random columns to start

100000.times do

calculate the sums of each row

if solution is found

store/ouptut the matrix

else

randomly choose one of the rows that is below threshold

find the column with the smallest value in that row

swap it out with a random column from the pool

endif

end #loop

It should perform slightly better than naive brute force by focusing attention on the worst columns

posted by chrisamiller at 10:28 AM on June 14, 2009

Do you have access to MATLAB?

posted by Commander Rachek at 11:50 AM on June 14, 2009

posted by Commander Rachek at 11:50 AM on June 14, 2009

This post is useless without pics! (oops i meant useless without the actual columns of numbers)

if you're down to 45 columns and 15 rows, I feel like you should have been able to crunch through more scenarios than just the top 1000.

I'm assuming you are trying combinations of columns starting with the highest means. If all you really need to do is calculate the mean of 15 rows each time you try a new combination, it seems like a brute force search would have netted more than 1000 trials by now.

posted by mrgoldenbrown at 12:24 PM on June 14, 2009

if you're down to 45 columns and 15 rows, I feel like you should have been able to crunch through more scenarios than just the top 1000.

I'm assuming you are trying combinations of columns starting with the highest means. If all you really need to do is calculate the mean of 15 rows each time you try a new combination, it seems like a brute force search would have netted more than 1000 trials by now.

posted by mrgoldenbrown at 12:24 PM on June 14, 2009

OK, what i should also have mentioned is that if we knew more details about the particular set of data, we might have ideas for optimizing the search. For example, let's say the first row of numbers looked like a pile of 1's and a solitary 42, and the average had to be 17. then in that case you would know that if a solution exists, the column with 42 has to be involved. your data might not be quite as contrived as that, but the principle of eliminating impossible rows or autoincluding required rows might limit you search.

posted by mrgoldenbrown at 12:33 PM on June 14, 2009

posted by mrgoldenbrown at 12:33 PM on June 14, 2009

mrgoldenbrown, it's still probably too big a problem to brute-force: 45 choose 15 is still over 300 billion permutations.

The difficulty is that your problem is a (0,1)-integer optimization problem, and these are generally NP-hard. You can see this by writing your data matrix as an*m*x*n* matrix A, and your minimum row mean as *a*. (We'll also let x_{1} be an *m*x1 column vector of ones.) The problem is then to find an *n*-element column vector x_{2} that will minimize ^{1}/_{(mn)} x_{1}^{T}Ax_{2}, subject to:

posted by Upton O'Good at 10:51 PM on June 14, 2009

The difficulty is that your problem is a (0,1)-integer optimization problem, and these are generally NP-hard. You can see this by writing your data matrix as an

^{1}/_{n}Ax_{2}> a x_{1}- x
_{2j}∈ {0,1} for j in 1 to n. - Σx
_{2j}= m

posted by Upton O'Good at 10:51 PM on June 14, 2009

Hmm, that "minimize" should be a "maximize" in the problem definition above. Sorry about that.

posted by Upton O'Good at 7:08 AM on June 15, 2009

posted by Upton O'Good at 7:08 AM on June 15, 2009

...and the n's in the denominator of the cost function and the first constraint should be m's. (Maybe I shouldn't write AskMe answers at 2AM.)

posted by Upton O'Good at 7:20 AM on June 15, 2009

posted by Upton O'Good at 7:20 AM on June 15, 2009

Seconding the suggestion for formulating the problem as an integer linear program. Let your 45x20 matrix be denoted A[i,j], i=1..20, j=1..45 and let B be the bound on your row sums.

You want something like:

Maximize:

sum {j=1..45} A[i,j] * x[j]

Such that:

x[j] is integer, for j=1..45

0 <= x[j] <= 1, for j=1..45

sum {j=1..45} A[i,j] * x[j] >= B, for i=1..20

While these kinds of problems are hard in general, in practice, it's the exact structure of your problem that matters. The solver might take ages and ages or it might take a few milliseconds to return (either with an optimal solution or a determination that the problem is infeasible). All you can do is try it and see. It looks like Mathematica supports integer linear programming, but I've never tried it.

posted by mhum at 1:58 PM on June 15, 2009

You want something like:

Maximize:

sum {j=1..45} A[i,j] * x[j]

Such that:

x[j] is integer, for j=1..45

0 <= x[j] <= 1, for j=1..45

sum {j=1..45} A[i,j] * x[j] >= B, for i=1..20

While these kinds of problems are hard in general, in practice, it's the exact structure of your problem that matters. The solver might take ages and ages or it might take a few milliseconds to return (either with an optimal solution or a determination that the problem is infeasible). All you can do is try it and see. It looks like Mathematica supports integer linear programming, but I've never tried it.

posted by mhum at 1:58 PM on June 15, 2009

This thread is closed to new comments.

posted by losvedir at 6:32 AM on June 14, 2009