How to maximise a matrix subject to constraints?
June 14, 2009 5:16 AM   Subscribe

Optimisation problem: trying to find the subset of my data that meets a couple of constraints.

I have a data set with 200 columns and 20 rows of numbers. I am trying to find the 20 columns that make the mean of all values in the resulting 20 x 20 data set as large as possible--subject to the constraint that no row in the 20 x 20 set can have a mean value below a particular figure.

I have access to a range of software tools (Excel, Mathematica, SPSS). I know how to specify the constraint if I were going for a brute force solution, but obviously that's not going to be feasible here. All suggestions greatly appreciated. . . . .
posted by muhonnin to Science & Nature (18 answers total)
Hm, fun problem. I'll think about it for a bit, and think about how to include that constraint, but here's a quick thing that can make the calculation easier: I think to find the 20x20 matrix with the largest mean, you can simply calculate the mean of each column, and then take the top 20 of those. But now how to include that constraint, hm.....
posted by losvedir at 6:32 AM on June 14, 2009

With a relatively small matrix, a brute force solution would probably work fine. This would be annoying to write in Excel, but seems pretty straightforward in any programming language that can handle arrays or matrices well. I don't have the budget to use Mathematica, but I wrote up a quick solution in R below (using random data). Haven't done any debugging, but I don't think there are huge problems with it.

# set up some parameters
number.of.columns <> number.of.rows <> small.matrix.size <> minimum.row.cutoff <>

# make some random data
m <>

# initialize the results vectors
min.row.mean <> matrix.mean <>
# calculate the averages
for (i in 1:length(matrix.mean)) {
small.m <> min.row.mean[i] <> matrix.mean[i] <> }

# find the solution
max(matrix.mean[min.row.mean > minimum.row.cutoff])
posted by eisenkr at 6:53 AM on June 14, 2009

hmm... that code got butchered via the posting mechanism. I'll mefi mail you the code, hopefully it won't get destroyed that way.
posted by eisenkr at 6:54 AM on June 14, 2009

the mean of all values

Do you mean the mean across rows, columns, or both?
posted by a robot made out of meat at 8:44 AM on June 14, 2009

Oh, I guess that it doesn't matter.
posted by a robot made out of meat at 8:46 AM on June 14, 2009

How pathologic is the dataset? Are the column means close? Variances? Are the rows values roughly evenly distributed? If you just use the 20 biggest columns how many of your rows violate the minimum condition? I was starting to think of guided searches for good columns, but the strategy depends on what the data look like.
posted by a robot made out of meat at 8:58 AM on June 14, 2009

eisenkr: the problem is that there are choose(200,20) = 1.6e+27 possible choices of columns.
posted by a robot made out of meat at 9:00 AM on June 14, 2009

Well, the mean (really the sum, save yourself a division) of your matrix is linear in the mean (sum) of the columns. If it wasn't for the constraint on your rows, you'd get the answer by just picking the 20 columns with the highest sums.

I agree with a robot made out of meat that it depends on the nature of the data, in particular, how common problematic rows are.
posted by atrazine at 9:42 AM on June 14, 2009

See, I was thinking that you could brute force it by starting with the greatest possible sum, then working your way down and checking for your constraint at each combination of columns until you find a match, but whether that would work depends on how far you'd have to go before hitting a matching combination.

Even that's not so easy because it's much easier to find the maximum than it is to find the second highest.
posted by atrazine at 9:47 AM on June 14, 2009

I've been making progress and have been able to discard some data so that I now have 45 columns and 15 rows, but this still isn't enough to allow a brute force solution. The data are fairly problematic; looking at the top 1000 or so combinations of 20 columns, I get no fewer than 3 rows (and not always the same 3) which violate the constraint. I'm focusing now on seeing whether there is any solution which meets my criteria, rathr than finding the best one.

If I wanted to spend a couple of days on it I could try to remember how to code up a genetic algorithm to search for a solution, but in practical terms I can probably only afford to spend a few more hours. If anyone can point me to some resources on other methods to solve this type of problem, I'd be obliged.
posted by muhonnin at 9:59 AM on June 14, 2009

Some pseudo code:

Choose 20 random columns to start

100000.times do
  calculate the sums of each row
  if solution is found
    store/ouptut the matrix
    randomly choose one of the rows that is below threshold
    find the column with the smallest value in that row
   swap it out with a random column from the pool
end #loop

It should perform slightly better than naive brute force by focusing attention on the worst columns
posted by chrisamiller at 10:28 AM on June 14, 2009

Do you have access to MATLAB?
posted by Commander Rachek at 11:50 AM on June 14, 2009

This post is useless without pics! (oops i meant useless without the actual columns of numbers)

if you're down to 45 columns and 15 rows, I feel like you should have been able to crunch through more scenarios than just the top 1000.

I'm assuming you are trying combinations of columns starting with the highest means. If all you really need to do is calculate the mean of 15 rows each time you try a new combination, it seems like a brute force search would have netted more than 1000 trials by now.
posted by mrgoldenbrown at 12:24 PM on June 14, 2009

OK, what i should also have mentioned is that if we knew more details about the particular set of data, we might have ideas for optimizing the search. For example, let's say the first row of numbers looked like a pile of 1's and a solitary 42, and the average had to be 17. then in that case you would know that if a solution exists, the column with 42 has to be involved. your data might not be quite as contrived as that, but the principle of eliminating impossible rows or autoincluding required rows might limit you search.
posted by mrgoldenbrown at 12:33 PM on June 14, 2009

mrgoldenbrown, it's still probably too big a problem to brute-force: 45 choose 15 is still over 300 billion permutations.

The difficulty is that your problem is a (0,1)-integer optimization problem, and these are generally NP-hard. You can see this by writing your data matrix as an mxn matrix A, and your minimum row mean as a. (We'll also let x1 be an mx1 column vector of ones.) The problem is then to find an n-element column vector x2 that will minimize 1/(mn) x1TAx2, subject to: However, it's probably a small enough problem to be able to get a pretty good, though not necessary globally optimal, answer with existing software packages. Here, other than the binary-integer constraint, this is a standard linear programming problem with inequality and equality constraints. If you don't mind using specialized software, AMPL is an optimization programming language that I use regularly for linear problems. It can enforce integer constraints, though you'll want to pick a solver optimized for integer problems. Otherwise, "(0,1)-integer programming" is probably a good place to start searching the literature.
posted by Upton O'Good at 10:51 PM on June 14, 2009

Hmm, that "minimize" should be a "maximize" in the problem definition above. Sorry about that.
posted by Upton O'Good at 7:08 AM on June 15, 2009

...and the n's in the denominator of the cost function and the first constraint should be m's. (Maybe I shouldn't write AskMe answers at 2AM.)
posted by Upton O'Good at 7:20 AM on June 15, 2009

Seconding the suggestion for formulating the problem as an integer linear program. Let your 45x20 matrix be denoted A[i,j], i=1..20, j=1..45 and let B be the bound on your row sums.

You want something like:

sum {j=1..45} A[i,j] * x[j]

Such that:
x[j] is integer, for j=1..45
0 <= x[j] <= 1, for j=1..45
sum {j=1..45} A[i,j] * x[j] >= B, for i=1..20

While these kinds of problems are hard in general, in practice, it's the exact structure of your problem that matters. The solver might take ages and ages or it might take a few milliseconds to return (either with an optimal solution or a determination that the problem is infeasible). All you can do is try it and see. It looks like Mathematica supports integer linear programming, but I've never tried it.
posted by mhum at 1:58 PM on June 15, 2009

« Older On training a three-month-old ...   |  What are the best graphic dema... Newer »
This thread is closed to new comments.