How can I best visualize the results of my combinatorial optimization problem?
October 16, 2008 5:07 AM   Subscribe

DataVisualizationFilter: How can I best visualize the results of my combinatorial optimization problem? I'm trying to solve an n-dimensional discrete optimization problem using some metaheuristic algorithms. More details behind the cut.

Imagine you have M resources (R1, R2, ... , Rm) and N tasks (T1, T2, ..., Tn). There are currently no constraints on which resources you can use for what task, or how many tasks a single resource can satisfy. Any given configuration will be evaluated on two objectives, cost and value. You're trying to minimize cost and maximize the value of your solution.

Sample solution:

R1: T1, T2, T3
R2: T4
R3: T5, T6
R4: T7, T8, T9
R5: T10

This solution yields a Cost of .2 (on a 0-1 scale) and a Value of .8 (on a 0-1 scale).

I can currently plot my solutions in objective space as a scatter plot where each axis is each objective and then solution is an (x,y) point in that space. I'm currently comparing a random sampling of solutions to my metaheuristic.

I'm interested in other solutions to plot in the decision space. For example, if I do a random sampling of my possible solutions, do good solutions match some criteria? If R4 performs T7, does it matter who satisfies T10 or will I get around the same region in objective space? Are there building blocks of solutions (R4 performs T7, R5 performs T1, T2, T3) that yield the best results? Is one solution brittle because all of its neighbors quickly drop off in satisfying objectives while another solution is robust because its neighbors are more stable?

Ideally, I'd setup axes for each dimension of the solution (each resource or each task) and then plot the solution against that axes, but since the values are discrete, there's no relationship to the order. Furthermore, n-axes is going to be mighty difficult to visualize.

I feel like what I'm trying to do must be solved in either the data visualization field or statistics. Is there some clustering algorithm I could use? Is there a way to reduce the dimensionality? Something like k-means. nearest neighbor? The best euclidian distance I can have between two solutions is something like Hamming distance (how many matches there are).

Guidance would be much appreciated!
posted by miasma to Science & Nature (9 answers total) 1 user marked this as a favorite
 
Not a solution in the least but more of suggestion so you change your approach:

I find it strange that you are using a scatter plot as a way to solve this. Normally for optimization problems, you use linear programming or non-linear programming optimization. http://en.wikipedia.org/wiki/Nonlinear_programming Random sampling will never prove an optimal solution after all. There are some free programs to download to do LP and NLP.

If you are convinced you want to do this random sampling, you could always look for solutions that _dominate_ other solutions. By this I mean, solutions that are better in all evaluation criteria than others, the others are discarded and you continue until you are left with. If there is a boundary of solutions that are dominated/not dominated, I guess a really, really extensive random sampling might reveal this. On the other hand, this is just a real shot in the dark.

I think that Nonlinear programming is your best bet.
posted by FastGorilla at 7:14 AM on October 16, 2008


Not a solution in the least but more of suggestion so you change your approach:

I find it strange that you are using a scatter plot as a way to solve this. Normally for optimization problems, you use linear programming or non-linear programming optimization. http://en.wikipedia.org/wiki/Nonlinear_programming Random sampling will never prove an optimal solution after all. There are some free programs to download to do LP and NLP.

If you are convinced you want to do this random sampling, you could always look for solutions that _dominate_ other solutions. By this I mean, solutions that are better in all evaluation criteria than others, the others are discarded and you continue until you are left with only the non-dominated. If there is a boundary of solutions that are dominated/not dominated, I guess a really, really extensive random sampling might reveal this. On the other hand, this is just a real shot in the dark.

I think that Nonlinear programming is your best bet.
posted by FastGorilla at 7:16 AM on October 16, 2008


Seconding FastGorilla; the first thing I thought of was linear programming as well. I can't speak for the nonlinear approach. The result of LP should be a shape or surface representing all possible solutions, and then it's just a matter of picking the vertex that represents the optimal combination of cost, value, whatever your objective functions are.

Your comment about finding good classes of solutions also suggests genetic algorithms, or letting good solutions "mate" and exchange elements of structure to see if it creates a great solution. The GA idea of schemas, or a family of related solutions, corresponds with your desire to see if a general class of solutions (such as R2 perfoming T3) regularly outperforms other solutions. If this is a LP type of problem, though, then I'd imagine GA is going to be a lot of work to set up for little payoff. In any case, good luck!
posted by sapere aude at 8:47 AM on October 16, 2008


Hmmm... doesn't it matter how the costs and values are arrived at? I would think the answers to your questions depend on whether it's just some black box you put a solution in and out pops a (Cost, Value), versus something where you see how each particular Resource-Task match-up contributes to the overall cost and value. For example:

If R4 performs T7, does it matter who satisfies T10 or will I get around the same region in objective space?

I don't see how this could be known without more details about how the costs and values are calculated.
posted by losvedir at 9:09 AM on October 16, 2008


Response by poster: Oh, I think I've confused some people.

I agree with you all about how to solve the optimization problem. LP and NLP are definitely tasks. And it's actually a genetic algorithm-based approach that I'm using as a metaheuristic.

The random sampling is just a way to compare this approach to other approaches.

What I'm really interested in is visualizing the results of these generate and test algorithms. I'd like to see what sort of results have been generated in the decision space. The scatterplot gives me an idea of what it looks like in objective space.
posted by miasma at 11:42 AM on October 16, 2008


Best answer: You might trying applying Factor Analysis or the closely related Principal Component Analysis to your decision space. You can specify each solution as a point in {1, 2, ..., m}n. The idea of these two techniques is to take your set of solutions as a cloud of points in n-dimensional space, rotate it around, and project it onto a lower-dimensional space (say, 2-dimensional) so that the lower-dimensional projection (or shadow) is somehow meaningful. Each axis in the lower-dimensional space would be a linear combination of the original dimensions (in your case, tasks). This may or may not have meaning for you, but these kinds of analyses are pretty standard.
posted by mhum at 5:49 PM on October 16, 2008


Response by poster: @mhum

I think this is more on the right track, thanks. Could you provide an example, perhaps? I'm still unclear as to how I can put discrete points on an axis. R1 cannot really be related to R2 in terms of T1. I think I'm just still confused about how to apply Factor Analysis or PCA to this sort of problem
posted by miasma at 8:09 AM on October 17, 2008


Best answer: On second thought, since your data is categorical, I'm not so sure that a straight-up PCA (in the formulation I proposed) would be the way to go here. Any assignment of these discrete variables (i.e.: resources) to an axis would be largely arbitrary. There appears to be some kind of version of PCA for categorical variables, but I don't know anything at all about it. There's also something called "joint correspondence analysis", but I know even less about that.

However, we might be able to do something slightly different. Instead of trying to directly visualize the solution space, perhaps we can try visualizing properties of the solution space. For example, for each solution, we can form a vector (w1, w2, ..., wm) where wk is the number of tasks assigned to resource k (or maybe wk is the total costs or total value associated with tasks assigned to resource k). By formulating the solution space this way, we get vectors in Rm instead of some discrete space and thus you can apply the usual PCA-type analyses directly. Granted, you lose the task-specific information (e.g.: does task 1 always get assigned to resource 3?), but you might be able to tease out some better information this way.

In any case, if your problem is more about visualizing your decision space rather than drawing analytic conclusions, there are maybe other less rigorous options. For example, let's say you wanted to compare some set of solutions to a single reference solution. Assign a color to each task according to which resource it was assigned to in the reference solution (maybe you can assign different gradations of the same color for different tasks assigned to the same resource). Then, draw each solution as a sequence of color strips, each strip corresponding to a resource, each entry on the strip corresponding to a task. This will only work if you don't have unmanageably many resources. If this is unclear (which I'm pretty sure it is) I might be able to whip up an example.

Another possibility for visualization is multidimensional scaling. The idea of this technique is this: given a set of things and a similarity function between any pair of things, we want to plot the set in two (or maybe three) dimensions such that the Euclidean two-dimensional distance between any two plotted points more or less corresponds to the similarity function distance between the underlying things. In your case, the things are your solutions and the similarity function would be some kind of distance metric between partitions (a description of three such measures can be found here; it looks like the Rand index is the easiest). This kind of analysis can sometimes show clusters of things that wouldn't have been apparent otherwise. I know there exist (or existed) free software to do this because I've downloaded it before, but I'll be damned if I can find it again.
posted by mhum at 11:25 AM on October 17, 2008


You might want to check out GGobi. It looks like it has a bunch of tools for exploring high-dimension data sets.
posted by mhum at 12:49 PM on October 17, 2008


« Older Stories about or featuring werewolves   |   Where can a Philly kid watch his team in the World... Newer »
This thread is closed to new comments.