SQL Combinations
August 15, 2007 3:38 AM

Any SQL / math ninja's on the green fancy a quick challenge..? So I've got this table containing number pairs, right, and I need to compute all the possible combinations.

For example: I have the following pairs: {1,2}, {1,3}, {3,4}, {2,5}

The implementation I need will produce the following combinations: {1,2}, {1,3}, {3,4}, {2, 5}, {1,3,4}, {1,2,3}, {1,2,5}, {1,2,3,5}, {1,2,3,4}, {1,2,3,4,5}

But not these: {1,4}, {1,5}, {3,5}, {1,2,4}, {1,3,5}

I know, I know, sounds like homework filter but I can assure you it's not - I'm building a shopping cart app thing on LAMP and I can't quite see a clear way around this conundrum. I mean, it could be done recursively in PHP but thats going to involve a) lots of connections / queries to the database, and b) the kind of headaches that only recursion can induce.

Can you see a way of doing this in SQL? Or even an algo which won't hammer the DB server with zillions of l'il queries?
posted by whoojemaflip to Computers & Internet (25 answers total) 3 users marked this as a favorite
Uh... you're looking for numbers that can be chained together? You might have to be a bit clearer with your description.

How is (1,2,3,4,5) valid?

I think the answer will be that you're not storing you data correctly, though.
posted by Leon at 4:32 AM on August 15, 2007


um, yeah I was a bit vague I guess.

The way this looks in my head is a kind of recursive inner join, which makes no sense I know.

For example: {1,2,3} is valid because the '1' in {1,2} matches the '1' in {1,3} and reducing that set to unique numbers makes {1,2,3}

{1,2,3,4,5} is valid as you can link {1,2},{1,3},{3,4} and {2,5} in the same way.

{1,2,4} is not valid because there is no link between 1, 2 and 4 that doesn't involve a number from another pair.

I know it sounds (more than a little bit) mad to do it this way but if I start with an arbitary number of the initial pair sets then the number of resulting sets is potentially huge - certainly too big to enumerate by hand.

If it helps you to understand the problem, the pair sets that I start off with represent software products that work together, and the resulting combinations will be of groups of software products that integrate.
posted by whoojemaflip at 4:50 AM on August 15, 2007


Sounds like a graph where you are taking all the directly connected subsets. Edges in the graph are defined by your original pairs.
posted by DarkForest at 5:11 AM on August 15, 2007


your problem is related to graphs - you can think of your numbers as labelling nodes in a graph and the pairs as indicating that there is a line joining two nodes. then what you are asking for is all the distinct sub-graphs (or something closely related).

the bad news is that SQL is notoriously bad at handling these kinds of data (at least, the implementations in free SQL implementations) - there is no simple ("one query") solution.

of course, you can do this with several queries. the best way to start (if no-one gives an answer here) would be to read up on algorithms related to graphs. the most efficient solution will probably depend on the number of nodes you have, so it might be worth saying approximately how many different products there are.

assuming it's fairly small (less than millions) i would guess that it's easiest to read the data into memory, construct a tree in memory (the best representation will depend on how sparse the graph is - do most products work together, or does each only work with one or two others?), and finally traverse the tree in some way to generate the output.
posted by andrew cooke at 5:19 AM on August 15, 2007


[sorry, DarkForest, didn't preview]
posted by andrew cooke at 5:20 AM on August 15, 2007


Ok, I think I see what you're doing. A works with B and C, B works with D and E, therefore A works with B, C, D and E?

You're doing too much work in the SELECT. Make the INSERT phase more complex, and store some redundant data. You only do the INSERT once, so you can afford for it to be heavy. The SELECTs happen thousands of times, so you want them to be light.

A table structure like this:

product(productid)
workswith(productid1,productid2)

Would let you do a very simple SELECT:

SELECT productid2 FROM workswith WHERE productid1 = 1;

The easiest way to discover all the new workswith pairs, IMO, is to keep throwing an IN statement at the database, Pseudocode:
$products = (1,2,3)
WHILE (LEN($products) is still growing)
    $newproducts = "SELECT productid2 FROM workswith WHERE productid1 IN ($products)"
    $products = DISTINCT ($products + $newproducts)
(on preview, yes, the structure I'm suggesting is basically a directed graph, I'm just suggesting doing as much computation as possible on INSERT to save time on the SELECTs).
posted by Leon at 5:30 AM on August 15, 2007


Hmm are you absolutely sure you're modelling it effectively? Just because software product A works with B, and B works with C, does it always follow that A works with C? What if A and C do the same thing and would clash? Are pairs the best option, or would broader groups/tagging be better in terms of flexibility and performance?
(Yeah I know, more questions rather than an answer, but it does sound a bit odd for a cart system!)
posted by malevolent at 5:53 AM on August 15, 2007


this article will probably help. celko (the author) is good.
posted by andrew cooke at 6:03 AM on August 15, 2007


To clarify the model: If we have {A,B} and {B,C} then {A,B,C) will work but not {A,C}.

Recursive programming makes me think of Morpheus saying '... and I'll show you how deep this rabbit-hole goes'. I'm hoping for a 'blue pill' solution where the story ends before I find myself surrounded by mutant robotic datastructures that eat all my server's ram.

I like the Celko method, but can't help wondering if it will take me all day to wrap my head around his nifty SQL reverse function. I can be a bit limited you see :)

I really like the idea of de-normalising the tree structure to save having to do this calculation on each page refresh.

I'm only going to use a subset of the graph each time, depending on the contents of the shopping basket, which pages are viewed.

Feeling slightly less clueless, but not on top of it yet. Thanks for y'alls help so far :D
posted by whoojemaflip at 6:56 AM on August 15, 2007


Trying to do this in SQL sounds painful. Is there any way you could dump the data, once it's retrieved, into some procedural language and build the graph and process it there?
posted by DarkForest at 7:08 AM on August 15, 2007


Another example:

I have in the DB, a table of products(PID)
and a table of product map pairs (PID , MAP_PID).

I need to calculate the subset of product groups (as described above) that are relevant to a given set of products.

E.g. Given the mappings {A,B}, {A,C}, {C,D}, if I have product {A} in the basket then I want to know that {B}, {C}, {C,D} and {B,C,D} are all relevant groups of products.

Yeah - it's definitely a bit odd for a cart system, no doubts about that! If you can see a better way of doing this, I'm all ears.
posted by whoojemaflip at 7:11 AM on August 15, 2007


don't suspect this is for you, but might be of general interest...

not all database languages are restricted in this way. there's a language called datalog in which this kind of query is simple. and, luckily, there's also an implementation that uses a sql database as the "backend". so you can store your data in the sql database but use datalog queries for graph-related problems. i used this approach once, my notes are here.
posted by andrew cooke at 7:20 AM on August 15, 2007


if you don't have too many products, and you don't ever want to delete products (maybe that is possible - i haven't thought how) then i would suggest having an association table that gives all the results, and building it on insert. this is something like Leon's approach, but uses SQL more.

so it's basically another table, similar to the (PID, MAP_PID), but populated so that (A, D) is in there too (from your example above where A is connected to D only indirectly).

to build that table you would use a stack. say you are adding X. on that stack you would push any items directly connected to X. then you would repeatedly, until the stack is empty:
- pop the head of the stack (call it H)
- add (X, H) to the "complete" association table
- for each item I connected to H in the table:
-- check that there is not already a link (X, I)
-- if no link already, push I onto the stack

but deleting is a problem. if you delete products infrequently the simplest approach is probably to delete this table completely and then rebuild it after deleting the product.
posted by andrew cooke at 7:29 AM on August 15, 2007


Here's a naive approach at it (programmaticaly):

1. Construct a node for every valid member.
2. Construct a lookup table that points to all those nodes.
3. For every valid pairing of members, lookup the nodes and add a pointer (edge) to each other.

4. a) Optimise space: to determine if {1,2,3} is valid, look at each member and check for a pointer to another member in that set. If there is a member that has no pointer to at least one other member, then it is not a valid set.

4. b) Optimise time: use a traversal algorithm on the construct to find and store all subgraphs. To determine if {1,2,3} is valid, just see if its one of the calculated subgraphs.

Or you can use a combination of both: only calcualte subgraphs of size n or less and just perform real-time determination on those large and hopefully infrequent sets.
posted by tksh at 7:29 AM on August 15, 2007


I'm not sure, but it looks like you're almost looking for subgroups of the permutation group formed by a given set of elements. That is, treat your pairs as 2-cycles, so {1,2}=(12) means that 1->2 and 2->1.

In this example, you also have (13). If you want to form a group with these elements, you have to ensure closure (you also need an identity, which does nothing). That is, you have to include the products of all elements in the subgroup, for example (13)(12)=(123). (In case you're unfamiliar with the notation, read (13)(12) from right to left. E.g, acting on 1, 1->2 from the (12), but then it stays there, because (13) does nothing to 2, so overall 1->2. Acting on 2, (12) sends 2->1, and then (13) sends 1->3, so 2->3. Similarly, 3->1, so (13)(12)=(123) which reads, 1->2, 2->3, 3->1.) Note, however, that you also have (12)(13)=(321), and (12)(13)(12)=(23).

From this point of view, I'm not sure why you would exclude, for example, (14)=(34)(13)(34). I think mostly I'm not sure I understand your rules, or I don't really see a way to codify them self-consistently. Maybe you want to limit yourself to even permutations or to products where no element appears twice in the product.

Anyhow, one easy way to deal with these is to treat each 2-cycle as a matrix, where the cycle (ij) becomes a matrix with the (i,j) and (j,i) entries equal to 1, all (k,k) entries where k != i and k != j equal to 1, and all other entries zero. Then multiply the matrices as usual to get a new one. In this way, all you would need is: a) a routine to turn a cycle into a matrix, b) a very simple matrix multiplication routine, and c) the inverse of a.

In short, this looks to me like a group theory problem.
posted by dsword at 7:46 AM on August 15, 2007


This article might help. It taught me how to represent trees in SQL.

The bit about products only working in conjunction (A only works with C if B is present) makes the problem more complex than I thought. I might resort to working on paper until I had a model that captures everything I need.
posted by Leon at 8:28 AM on August 15, 2007


Here's how I understood the problem:
You have several base combinations k0, k1, …, kn
A derived combination d is possible if d is the union of one or more of the ki.
You want to find all the possible derived combinations.

Here's one method to do it:
* start with d0 = {k0, k1, ..., kn}
* form di+1 by finding the union of each kj with each element of di, plus the elements from di
* when di+1 and di are identical, you've found all the combinations that are possible.

Here's a Python program that does this:
def form_all_combinations(k):
   k = [frozenset(ki) for ki in k]
   d = set(k)
   while 1:
       l0 = len(d)
       for di in list(d):
           for kj in k:
               d.add(di|kj)
       if l0 == len(d): break # No items added by this iteration
   return sorted(list(di) for di in d)

k = (1,2), (1,3), (3,4), (2,5)
for it in form_all_combinations(k): print it
It prints:
[1, 2]
[1, 2, 3]
[1, 2, 3, 4]
[1, 2, 3, 4, 5]
[1, 2, 3, 5]
[1, 2, 5]
[1, 3]
[1, 3, 4]
[2, 3, 4, 5]
[2, 5]
[3, 4]
Python 2.4 and above have 'set' as a built-in type making this a fairly easy problem to solve. If you're using another P-language in your LAMP it might not be quite so easy, but the algorithm will be the same.

Hm, I just noticed that my program produces [2, 3, 4, 5] while your example neither includes it nor excludes it. If the new combination "di | kj" should only be acceptable when di and kj already have one item in common, then change the line 'd.add(...)' to 'if di & kj: d.add(di|kj)'.
posted by jepler at 9:09 AM on August 15, 2007


Thanks to everyone who took the time to chime in... so many different ways of doing it. I solved it using an iterative deepening depth-first search. I found that it helped massively to visualise the data as vertices and edges.

For anyone whose interested - a solution in 31 lines of quickly hacked php:

foreach($nodes as $node) {
$solutions = limitedDepthSearch(array($node), $edges, count($nodes), $solutions);
}

function limitedDepthSearch($nodes, $edges, $depth, $solutions) {
$depth --;
if($depth == 0) return $solutions; // termination condition

foreach($nodes as $node) {
$connectedNodes = getConnectedNodes($node, $edges);
if($connectedNodes) {
foreach($connectedNodes as $cn) {
if(!in_array($cn, $nodes) && !array_key_exists($cn, $nodes)) {
$tmp = $nodes;
$tmp[] = $cn;
$solutions = array_merge($solutions, array($tmp));
$tmp2 = limitedDepthSearch($tmp, $edges, $depth, array());
if($tmp2) {
if(is_array($tmp2[0])) $solutions = array_merge($solutions, $tmp2);
else $solutions = array_merge($solutions, array($tmp2));
} } } } }
return $solutions;
}

function getConnectedNodes($node, $edges) {
foreach($edges as $edge) {
if($edge[0] == $node) $output[] = $edge[1];
else if($edge[1] == $node) $output[] = $edge[0];
}
if($output) return $output;
else return false;
}
posted by whoojemaflip at 9:12 AM on August 15, 2007


damn metafilter ate my indentation.


jepler: thats not quite it as {2,3,4,5} isn't possible. My neater than mine tho ;)
posted by whoojemaflip at 9:17 AM on August 15, 2007


that should read ^much^ neater than mine. Bonus points if anyone can tell me how to indent comments on aski-mefi. ;)
posted by whoojemaflip at 9:19 AM on August 15, 2007


Use <blockquote> to indent a whole region of code. Inside the code, replace two spaces with one space and one &nbsp;. (If you used tabs, first replace tabs by 8 spaces)

Another question about your desired output: If you have (1,2), (1,3), (1,4) can you get (1,2,3,4)? My program will give that as an answer, but your "chaining" explanation seems to exclude it.
posted by jepler at 9:34 AM on August 15, 2007


Thanks to Darkforest for showing me the way.

I'd still love to see an SQL solution though, now I know it can be done! Andrew cooke gets an award for the link to an article on recursive SQL - who'da thunk it?
posted by whoojemaflip at 9:40 AM on August 15, 2007


cheers jepler.

{1,2,3,4} is correct as {1,2}, {1,3}, {1,4} all have '1' in common, so although e.g. {2,3} and {3,4} aren't right, that example works.

As Leon said, its a touch subtler than simply listing the combinations of group of sets.
posted by whoojemaflip at 9:49 AM on August 15, 2007


Inside the code, replace two spaces with one space and one  

That's doing it the hard way. It's a lot easier to use <PRE>. The only drawback is that you need to use <br> instead of a newline otherwise it will be doublespaced. Or use pastebin. There was just a metatalk about this, so see that thread for more details.
posted by Rhomboid at 4:15 PM on August 15, 2007


rhomboid, one advantage (maybe) of using &nbsp; instead of <pre&gt is that the &nbsp;'d code will wrap on a narrow screen. Perhaps I should survey that metatalk thread you alluded to..
posted by jepler at 7:57 PM on August 15, 2007


« Older Macbook hard drive recovery   |   Where and how can I purchase WWD Accessories Fall... Newer »
This thread is closed to new comments.