# Help me optimize this naughty function!

April 8, 2006 5:04 PM Subscribe

OptimizationFilter: I need to maximize a function of about ten variables. I can only approximate the value of this function at any point, by running a simulation which takes about a minute of computation time. [more inside]

I tried using gradient descent, by approximating the gradient:

dF/dxi = [F(x1, x2, ..., xi + delta, ..., x10) - F(x1, x2, ..., xi, ..., x10)]/delta

but it takes forever, and the result is not much good.

Is there a better way to do this?

A general link to an article about optimizing non-differentiable functions or even a name of good book on the topic is very much appreciated.

I tried using gradient descent, by approximating the gradient:

dF/dxi = [F(x1, x2, ..., xi + delta, ..., x10) - F(x1, x2, ..., xi, ..., x10)]/delta

but it takes forever, and the result is not much good.

Is there a better way to do this?

A general link to an article about optimizing non-differentiable functions or even a name of good book on the topic is very much appreciated.

do you have any idea about the shape? is there a single global minimum? are any of the variables closely correlated? does the function approximate a standard function (quadratic, say) along any axis? is it possible to separate out any of the variables into an analytic function? or even just into a faster simulation? do you know if any variable should be particularly independent of the others?

you really need to

anyway, to answer the question, a good basic reference is numerical recipes. see chapter 10. in particular, sounds like you may be using steepest descent (bad) rather than conjugate gradient (good).

(are you minimizing a least squares function ("fitting" data, chi-squared, etc)? if so, you probably want to look at chapter 15, in particular the levenberg-marquardt method (which generalizes what you're doing with differentials to second order and does all the coordinates "at once"))

but you should think about the questions i gave. pulling out any analytic dependence will save you an awful lot, as will re-expressing your model in terms of variables that are only weakly correlated (and throwing away variables you really don't need for a first guess at the minimum).

on preview: personally, i would not recommend genetic algorithms unless you have a really hard problem with no chance of doing it classically. they may be cool, but they're using less information.

posted by andrew cooke at 5:29 PM on April 8, 2006

you really need to

*think*about what you're doing and exploit any knowledge you might have if you want to get an efficient solution.anyway, to answer the question, a good basic reference is numerical recipes. see chapter 10. in particular, sounds like you may be using steepest descent (bad) rather than conjugate gradient (good).

(are you minimizing a least squares function ("fitting" data, chi-squared, etc)? if so, you probably want to look at chapter 15, in particular the levenberg-marquardt method (which generalizes what you're doing with differentials to second order and does all the coordinates "at once"))

but you should think about the questions i gave. pulling out any analytic dependence will save you an awful lot, as will re-expressing your model in terms of variables that are only weakly correlated (and throwing away variables you really don't need for a first guess at the minimum).

on preview: personally, i would not recommend genetic algorithms unless you have a really hard problem with no chance of doing it classically. they may be cool, but they're using less information.

posted by andrew cooke at 5:29 PM on April 8, 2006

sorry, asking whether it approximates a quadratic is pretty stupid - most classical approaches assume it does, as you'd expect for smooth functions. i was thinking about numerical integration (which has exact solutions for known functions)...

posted by andrew cooke at 5:32 PM on April 8, 2006

posted by andrew cooke at 5:32 PM on April 8, 2006

weston:

The main problem with my current method is the number of evaluations required to go one step forward. As I said, I can only simulate this function, and each evaluation takes about one minute of computation time. To approximate the gradient, I need 10 function evaluations, which means 10 minutes of simulation for just one step ahead in gradient descent.

andrew:

Thanks! Lots of good points in your post.

I don't think I know anything about the shape of the function.

(Very vague description: F is basically the expected reward of some game, if a player chooses X as her policy. I evaluate F by playing a few thousand rounds of the game with the given policy and approximate the reward)

But maybe, as you said, I should think harder about the problem.

Numerical Recipes seems like an awesome book. I'm gonna read over chapter 10 now, and see if i get a better understanding of what I am doing wrong.

posted by lenny70 at 5:57 PM on April 8, 2006

The main problem with my current method is the number of evaluations required to go one step forward. As I said, I can only simulate this function, and each evaluation takes about one minute of computation time. To approximate the gradient, I need 10 function evaluations, which means 10 minutes of simulation for just one step ahead in gradient descent.

andrew:

Thanks! Lots of good points in your post.

I don't think I know anything about the shape of the function.

(Very vague description: F is basically the expected reward of some game, if a player chooses X as her policy. I evaluate F by playing a few thousand rounds of the game with the given policy and approximate the reward)

But maybe, as you said, I should think harder about the problem.

Numerical Recipes seems like an awesome book. I'm gonna read over chapter 10 now, and see if i get a better understanding of what I am doing wrong.

posted by lenny70 at 5:57 PM on April 8, 2006

Well, I don't think you're going to be able to do this

I'd go for Simulated Annealing in your situation. It's kind of like a genetic algorithm, but genetic algorithms are usually used for far more complicated input spaces.

posted by delmoi at 6:09 PM on April 8, 2006

*quickly*.I'd go for Simulated Annealing in your situation. It's kind of like a genetic algorithm, but genetic algorithms are usually used for far more complicated input spaces.

posted by delmoi at 6:09 PM on April 8, 2006

by the way, your gradient decent algorithm is

posted by delmoi at 6:12 PM on April 8, 2006

*highly*parallelizable. You could try to convince 10 friends to run the simulation on their computers.posted by delmoi at 6:12 PM on April 8, 2006

I've used simulated annealing (copied from numerical recipies) and it worked ok, but you do much better by limiting your search space by knowledge of the problem.

posted by shothotbot at 6:14 PM on April 8, 2006

posted by shothotbot at 6:14 PM on April 8, 2006

I agree that a genetic algorithm would be a good thing to look at, but you'd have to spend a lot of time on figuring out the specific parameters e.g. how many offspring per generation, how massive of a mutation to induce in each offspring, and so on. With a ten-dimensional solution space, I'd tend to think that you'd need a lot of offspring per generation -- 20 or 30, if not more. And if the solution space is really bumpy, you might need to permit the mutations to be quite large, so as to permit you to get out of the region of a local hump to find another one which might be bigger.

This isn't going to be something you solve rapidly (i.e. by tomorrow at 5PM). If it takes you 1 minute per iteration of the calculation, then unless you can gain access to some sort of supercomputing array computer, you can figure this is going to take you weeks. (If for no other reason, than because you're probably going to spend a week or two just tuning the parameters of the genetic algorithm itself.)

Of course, once you find a really high peak, you'll have to tune the algorithm again by cranking way back on the size of mutations, so that you search the region of the hump you found to locate its peak.

If you go this route, be forewarned that you will never know for sure if you've actually maximized it, by which I mean found God's Solution. You can probably find a really, really good answer this way, but you won't ever know for sure if there might not be another one which is just a skosh better.

posted by Steven C. Den Beste at 6:49 PM on April 8, 2006

This isn't going to be something you solve rapidly (i.e. by tomorrow at 5PM). If it takes you 1 minute per iteration of the calculation, then unless you can gain access to some sort of supercomputing array computer, you can figure this is going to take you weeks. (If for no other reason, than because you're probably going to spend a week or two just tuning the parameters of the genetic algorithm itself.)

Of course, once you find a really high peak, you'll have to tune the algorithm again by cranking way back on the size of mutations, so that you search the region of the hump you found to locate its peak.

If you go this route, be forewarned that you will never know for sure if you've actually maximized it, by which I mean found God's Solution. You can probably find a really, really good answer this way, but you won't ever know for sure if there might not be another one which is just a skosh better.

posted by Steven C. Den Beste at 6:49 PM on April 8, 2006

Using the correct approach is really important but also a lot depends on the software you are using to solve this. It sounds like you are doing this with a game so you are probably going to have to do simulations as you mentioned you currently are, particularly if the game space is not enumeratable (and, if it's an interesting game, it probably is not).

I write poker software, and one of the approaches I use is borrowed from the maker of pokerstove. Basically, although there are a bajillion possibilities when you have more than a few players, there are generally only a few hundred thousand actual possible different outcomes. So, with some foreknowledge of how a particular poker game works, you can divide all the enumerated game possibilities into equivalency classes, and solve each equivalency class, and scale the solution by the size of the class. As an example (you need to know a little about poker):

In texas holdem, f I had Ad Kc and you have 8d 8c, and you want to know how often each one would win, this is no different than if I have Ad Kc and you have 9d 9c . Why is this so? Well, in each case, any hand I could make with AK that would beat 88 would also be 99 and vice versa. So, you need not consider 99 and 88 to be different cases (actually, any pair under TT will be equivalent. The ones TT and above will be *slightly* different since they will interfere with AK's ability to make straights).

Anyway, the point is, you may be able to (programaticaly) break the search space down into classes and solve each of those.

posted by RustyBrooks at 6:52 PM on April 8, 2006

I write poker software, and one of the approaches I use is borrowed from the maker of pokerstove. Basically, although there are a bajillion possibilities when you have more than a few players, there are generally only a few hundred thousand actual possible different outcomes. So, with some foreknowledge of how a particular poker game works, you can divide all the enumerated game possibilities into equivalency classes, and solve each equivalency class, and scale the solution by the size of the class. As an example (you need to know a little about poker):

In texas holdem, f I had Ad Kc and you have 8d 8c, and you want to know how often each one would win, this is no different than if I have Ad Kc and you have 9d 9c . Why is this so? Well, in each case, any hand I could make with AK that would beat 88 would also be 99 and vice versa. So, you need not consider 99 and 88 to be different cases (actually, any pair under TT will be equivalent. The ones TT and above will be *slightly* different since they will interfere with AK's ability to make straights).

Anyway, the point is, you may be able to (programaticaly) break the search space down into classes and solve each of those.

posted by RustyBrooks at 6:52 PM on April 8, 2006

Oh, and the point I actually started to make in the previous one but then diverged from is that if there is some way to present your problem in the way that a commercial, or good open source, solver can understand, then you are on your way to a much faster solution. We use Dash's "xpress" optimizer to do financial problem solving at work and it's not only far faster than anything we'd have written, but it also approaches the problem from many different ways, depending on what it thinks might work best.

Language also matters. I redid all the core of my poker stuff in C (from a scripting language) and it is, of course, several orders of magnitude faster).

I also use parallel processing for the really hard core calculations (one thing I did recently was compare every omaha hand's equity against one other omaha hand, heads up. There are a LOT of omaha hands and this is that number squared). I built 4 bare-bones machines for this last year for a few hundred bucks each and I also borrow time from computers at work when I can. Obviously anything one computer can do in 10 hours, 10 of them can do in about 1 hour, if it's easily parallelizable (sp?)

posted by RustyBrooks at 6:56 PM on April 8, 2006

Language also matters. I redid all the core of my poker stuff in C (from a scripting language) and it is, of course, several orders of magnitude faster).

I also use parallel processing for the really hard core calculations (one thing I did recently was compare every omaha hand's equity against one other omaha hand, heads up. There are a LOT of omaha hands and this is that number squared). I built 4 bare-bones machines for this last year for a few hundred bucks each and I also borrow time from computers at work when I can. Obviously anything one computer can do in 10 hours, 10 of them can do in about 1 hour, if it's easily parallelizable (sp?)

posted by RustyBrooks at 6:56 PM on April 8, 2006

*I agree that a genetic algorithm would be a good thing to look at, but you'd have to spend a lot of time on figuring out the specific parameters e.g. how many offspring per generation, how massive of a mutation to induce in each offspring, and so on.*

Bleh, a "genetic" algorithm is so overkill for this, and there are so many variations that trying to do it would be a waste of time compared to some other optimization strategy.

posted by delmoi at 6:57 PM on April 8, 2006

I second the non-recommendation of generic algorithms: unless your input space is both complicated and orthogonal enough for crossing over to have a real positive effect, genetic algorithms will mostly act as stochastic gradient descent.

An important question you have to figure out is whether your search space has local maxima or whether there is only a single global maximum. If you have local minima to worry about, then simple gradient descent based approaches will most likely find a local maximumwhich can be much worse than the global one, and something like simulated annealing may be necessary to get you out of the maximum.

You mentioned that the function you're trying to maximize is expected reward as a function of a policy, which makes me suspect that it certainly has multiple local maxima.

I think you have the opportunity for gaining a major speedup by taking into account the fact that your function is an expectation, and most of the times, you don't need to compute it too accurately. Think about whether you really need to compute the gradient with a low variance at each step? What if in the beginning, you compute the average reward over only a few iterations, and as you begin to converge to a maximum, compute the value of your function more precisely.

As an elaboration on the above idea, you may want to take a look on how Markov Chain Monte Carlo (MCMC) works. Two common approaches are Gibbs Sampling and Metropolis-Hastings, both of which should have plenty of references on the web. In particular, take a look on how it's done with graphical models. All of it is difficult to quickly expain, but the upshot of it is that you can get away with only running one (or a few) simulations of your game for each parameter value adjustment, and instead of waiting for your expected reward to converge for each parameter value, you can wait for the pair (parameter values, expected reward) to jointly converge to some distribution. This will probably be a much more efficient use of your CPU cycles.

posted by bsdfish at 6:58 PM on April 8, 2006

An important question you have to figure out is whether your search space has local maxima or whether there is only a single global maximum. If you have local minima to worry about, then simple gradient descent based approaches will most likely find a local maximumwhich can be much worse than the global one, and something like simulated annealing may be necessary to get you out of the maximum.

You mentioned that the function you're trying to maximize is expected reward as a function of a policy, which makes me suspect that it certainly has multiple local maxima.

I think you have the opportunity for gaining a major speedup by taking into account the fact that your function is an expectation, and most of the times, you don't need to compute it too accurately. Think about whether you really need to compute the gradient with a low variance at each step? What if in the beginning, you compute the average reward over only a few iterations, and as you begin to converge to a maximum, compute the value of your function more precisely.

As an elaboration on the above idea, you may want to take a look on how Markov Chain Monte Carlo (MCMC) works. Two common approaches are Gibbs Sampling and Metropolis-Hastings, both of which should have plenty of references on the web. In particular, take a look on how it's done with graphical models. All of it is difficult to quickly expain, but the upshot of it is that you can get away with only running one (or a few) simulations of your game for each parameter value adjustment, and instead of waiting for your expected reward to converge for each parameter value, you can wait for the pair (parameter values, expected reward) to jointly converge to some distribution. This will probably be a much more efficient use of your CPU cycles.

posted by bsdfish at 6:58 PM on April 8, 2006

Assuming you know absolutely nothing more about your problem (you can't show weak correlation between variables, draw any type of dependency graph, introduce constraints on the optimization, you don't know whether local minima exist), I would use a direction-set method.

You said that you can't explicitly calculate derivatives, you do not want to be doing function evaluations to approximate them, you want to use an optimization strategy designed for you. See section 10.5 in Numerical Methods, specifically Powell's method.

Also, don't throw away Nelder-Mead as a potential choice. It's 'dumber' than direction-set methods and probably won't converge as quickly but you could slap the code together in an hour.

Oh, and if you're going to use the genetic algorithms approach, you would probably want to play many less rounds of the game, since a genetic algorithms search will literally canvass your solution space.

posted by onalark at 8:25 PM on April 8, 2006

You said that you can't explicitly calculate derivatives, you do not want to be doing function evaluations to approximate them, you want to use an optimization strategy designed for you. See section 10.5 in Numerical Methods, specifically Powell's method.

Also, don't throw away Nelder-Mead as a potential choice. It's 'dumber' than direction-set methods and probably won't converge as quickly but you could slap the code together in an hour.

Oh, and if you're going to use the genetic algorithms approach, you would probably want to play many less rounds of the game, since a genetic algorithms search will literally canvass your solution space.

posted by onalark at 8:25 PM on April 8, 2006

You might consider the Cross Entropy Method. Its convergence is probably better than genetic algorithms/ simulated annealing for some problems.

posted by blueyellow at 5:06 AM on April 9, 2006

posted by blueyellow at 5:06 AM on April 9, 2006

Nothing wrong with regular old gradient descent, if done properly.

1 - use the solver that comes in Excel

2 - use FMinCon or one of the other minimization functions from MATLAB

3 - Something from NETLIB, MINPACK perhaps?

posted by sandking at 2:57 PM on April 9, 2006

1 - use the solver that comes in Excel

2 - use FMinCon or one of the other minimization functions from MATLAB

3 - Something from NETLIB, MINPACK perhaps?

posted by sandking at 2:57 PM on April 9, 2006

One thing I'd like to mention is that your function is probably not particularly "nice". I'm assuming that the parameters you're perturbing describe the policy your agent is using. This can mean that a small change in parameters can have a large change in payoff -- so gradient estimates are going to be noisy.

I'd recommend taking a look at Nate Kohl and Peter Stone's paper "Machine Learning for Fast Quadrupedal Locomotion" [pdf]. They're looking at learning a policy for driving an AIBO, but it should generalize to what you're doing.

posted by agent at 3:43 PM on April 9, 2006

I'd recommend taking a look at Nate Kohl and Peter Stone's paper "Machine Learning for Fast Quadrupedal Locomotion" [pdf]. They're looking at learning a policy for driving an AIBO, but it should generalize to what you're doing.

posted by agent at 3:43 PM on April 9, 2006

what you're describing sounds quite ambitious. anyway, a few more comments:

- if you're dong this repeatedly, consider starting subsequent searches from near your best result from previous searches.

- when you say that the parameters describe a strategy, that raises the possibility that some parameters are not continuous. this might be obvious, but classical minimisation techniques are not going to work for any parameters that choose from a set of distinct values (like selecting different algorithms) - that's a totally different problem (and much harder).

- in my experience, the biggest waste of time and effort (assuming continuous variables, smooth functions, etc) comes from having two variables that are strongly correlated (where adjusting one can "cancel out" the other, so you can "play them one of the other"). you can spot this by plotting your function as contour levels in a 2D plot for each pair of variables in turn. this takes some time, but is often worth the effort - correlated variables have contours that point along the "diagonals".

- if those plots don't show nice, simple concentric circular contours, but instead have lots of structure then you've got problems. look at the fancier suggestions here (particularly simulated annealing)

- the markov chain stuff mentioned above sounds interesting (it's new to me).

posted by andrew cooke at 6:32 PM on April 9, 2006

- if you're dong this repeatedly, consider starting subsequent searches from near your best result from previous searches.

- when you say that the parameters describe a strategy, that raises the possibility that some parameters are not continuous. this might be obvious, but classical minimisation techniques are not going to work for any parameters that choose from a set of distinct values (like selecting different algorithms) - that's a totally different problem (and much harder).

- in my experience, the biggest waste of time and effort (assuming continuous variables, smooth functions, etc) comes from having two variables that are strongly correlated (where adjusting one can "cancel out" the other, so you can "play them one of the other"). you can spot this by plotting your function as contour levels in a 2D plot for each pair of variables in turn. this takes some time, but is often worth the effort - correlated variables have contours that point along the "diagonals".

- if those plots don't show nice, simple concentric circular contours, but instead have lots of structure then you've got problems. look at the fancier suggestions here (particularly simulated annealing)

- the markov chain stuff mentioned above sounds interesting (it's new to me).

posted by andrew cooke at 6:32 PM on April 9, 2006

Thanks a lot guys.

Many good points here [I don't want to paint the page white by marking every answer as 'best answer', so I restrict myself to simply thanking everybody]

I am playing with Simulated-Annealing, which (so far) seems very promising, or at least, much much better than what I had before with gradient descent.

MCMC also seems very cool, and will be my next attempt, should SA also fail me.

Side Question: Is MCMC basically similar to Simulated-Annealing, but with a bunch of particles instead of one, or did I completely misunderstand it?

Cheers.

posted by lenny70 at 8:52 PM on April 9, 2006

Many good points here [I don't want to paint the page white by marking every answer as 'best answer', so I restrict myself to simply thanking everybody]

I am playing with Simulated-Annealing, which (so far) seems very promising, or at least, much much better than what I had before with gradient descent.

MCMC also seems very cool, and will be my next attempt, should SA also fail me.

Side Question: Is MCMC basically similar to Simulated-Annealing, but with a bunch of particles instead of one, or did I completely misunderstand it?

Cheers.

posted by lenny70 at 8:52 PM on April 9, 2006

MCMC is for Bayesian estimation of the posterior distribution. I don't see its direct application here. It only makes sense if you are trying to sample from a particular distribution. It does not sound like your problem fits the bill.

Again, I think cross entropy is an effective approach for general combinatorial and non-convex optimization.

posted by blueyellow at 9:05 PM on April 9, 2006

Again, I think cross entropy is an effective approach for general combinatorial and non-convex optimization.

posted by blueyellow at 9:05 PM on April 9, 2006

« Older Oh what a tangled Web 2.0 we weave | Will the Canadian Goverment prepare my taxes for... Newer »

This thread is closed to new comments.

Have you looked at genetic algorithms?

posted by weston at 5:24 PM on April 8, 2006