Algorithms
September 27, 2004 9:01 AM   Subscribe

Programming Filter: I need some advice on a tree-building algorithm. (Tree as in the "+" and the tree branches out, like in Windows Explorer.) The language is PHP, but since I'm looking for an algorithm rather than actual code, any programmer with mad skills should be able to help. More inside.

I have a list of nodes in a MySQL database. Each node knows it's parent node in the tree. Each node can only have one parent node. First-level nodes have a parent node of null. The database also stores a list of nodes that are expanded for the particular user, in a serialized array in this case.

A typical tree/branch recordset looks like this:
id,short_descr,childof
1,"Trunk 1",NULL
2,"Branch 1",1
3,"Branch 2",1
4,"Twig 1",2

and in this test case, "1" is expanded but none others are. I need to output the following display:
- Trunk 1
    + Branch 1
    # Branch 2

What I need is an efficient algorithm to do this without hammering the database a billion times. I can easily do it the dirty way, but that way almost every row hits the database to check for its children... it's worse if the tree's branched at all. I'm not sure how to get it down to quick and easy, though.

Caveats:
1) I'm not going to change the recordset so that the parents keep track of their children; it's too easy to corrupt the data that way if an update gets missed or something breaks in a code change.
2) I'd prefer not to replace "Database" with "array", and manipulate one huge array. That seems to me like an inefficient way to do it. My goal isnt' to manipulate data, my goal is to output HTML.
3) While DHTML/Javascript seems to be an ideal solution, it's far too complex and breaks too easily cross-platform without a lot of hacking. I'm sure there are great JS libraries that do this without requiring the page to reload every time, but they won't work for this specific application due to a number of other conflicts.

Please try to keep it simple; small words. ;) I'm a relatively inexperienced and uneducated coder, although I have been hacking PHP for quite a long time and I'm extremely proficient with the language. I'm sure this is easy, I'm just not that good with the logic.

Oh, and while we're in here, any suggestions for good PHP Unit testing, and PHP-specific advice on unit testing? Or error-handling libraries?
posted by SpecialK to Computers & Internet (16 answers total)
 
If I understand correctly, you seem to have a fundamental mismatch between the data structure (your database layout) and desired representation (the tree). The best method for finding the children of a parent node appears to require searching through all nodes (i.e. "hammering the database a billion times"). No efficient algorithm exists.

Or maybe I'm completely missing the point of the question...
posted by Galvatron at 9:27 AM on September 27, 2004


trees with sql is a thorny topic. google for "celko".
posted by andrew cooke at 9:56 AM on September 27, 2004


It'd help to see the database schema, but if it looks like I think it does, you just need to join the table on itself (as many times as you have levels of ancestry, I think). I don't know what you would use besides JS + HTML to do the exploding tree in a browser. Java applet, I guess.
posted by yerfatma at 9:58 AM on September 27, 2004


I don't know what you would use besides JS + HTML to do the exploding tree in a browser. Java applet, I guess.


You don't need JS, you can do it all in php. But you'll have to reload the page whenever you expand a branch. I find that kinda ugly, but then so's javascript.
posted by jpoulos at 10:15 AM on September 27, 2004


I can't help much with an algorithm, but HTML_TreeMenuXL is a wonderful PHP component for generating dynamic tree menus. Highly recommended.
posted by waxpancake at 10:17 AM on September 27, 2004


Reconsider caveat 3 - DHTML can help a tree work really well, i.e., prevent browser refreshes, and the need to requery over and over again for the same data (unles you get into caching).

It gives you better speed because you are not sending the same HTML over and over to the browser. By using DHTML you can easily display trees with tens of thousands of nodes. For an elegant example of this examine the way Microsoft's MSDN Library tree works.

Your recordset structure is fine, no need to change. There are lots of other ways, but in reality that should work well.
posted by SNACKeR at 10:35 AM on September 27, 2004


If you've got a list of expanded node numbers in your session variables, you can set up the query to check for those nodes' children in one query (using something like SELECT * FROM mytable WHERE childof IS NULL OR childof IN (x, y, z) order by childof; where 'x,y,z' is a comma-separated list of the expanded node numbers). Then you just use a sorting/indenting loop to bring them into place.

If you want to figure out whether something's got children (to decide what icon to put next to it), you can do a query like SELECT childof as nodes_with_children, count(id) as number_of_children FROM mytable GROUP BY childof; to see how many children each node's got. You can drop the count(id) section if you don't need to know how many and just test for membership (childless items will not show up in the query).

OTOH, a lot of this stuff can be solved by just selecting the whole table and figuring out what you need in PHP. If you're going to have less than, say, a thousand nodes, you're unlikely to notice much speed difference. Especially now that most databases support query caching.
posted by boaz at 10:48 AM on September 27, 2004


What I need is an efficient algorithm to do this without hammering the database a billion times.

The easiest way would be to grab all the data from the database (ugly, but actually far more efficient in the long AND short run), then run a recursive loop to structure it into an array in Javascript. You can then play with the child nodes as you wish -- there are a lot of already-made solutions around.
posted by Civil_Disobedient at 10:50 AM on September 27, 2004


Depending on this app, you might want to think twice about relying on JS to provide critical functionality.

The most efficient way I've found is to have a cached version of the tree data held in a multidemensional array, saved to a text file by php. Make sure that the cache is rebuilt (by looping through the necessary SQLs) each time there is an update to the tree table. It's a bit of a hack, but I prefer to call it an "optimization".
posted by 4easypayments at 10:53 AM on September 27, 2004


Ok, self-help filter strikes again. I guess I just needed some time to mull it over.

(But first, some answers:
yerfatma: The database schema is: id int(8) auto_increment, short_descr varchar(80), childof int(8) ... very simple. That's all I should need.
On preview, Boaz: Thanks, that was part of exactly what I'm looking for... and you're right, it's easier to just keep the data in it's raw form in PHP. What I'd started writing last night and where I got lost was in this giant pre-branched array that I was trying to figure out how to generate, and then display...
Re: Javascript: Sure, it's dynamic and cool. But I can't use it, as a) some of the other JS on the page (menuing, etc.) has some gnarly conflicts with various types of DHTML, and b) Some of the people that will be using this app don't necessarily use DHTML-enabled browsers (handhelds, etc.) and it would be astoundingly harmful to the UI for these users to be forced to view the tree as 100% expanded. I just want to avoid it all together, and the page weight is tiny so it's a pretty fast load, as long as the database latency isn't too bad. )

There is a way to do it in a single query. See if you follow me.

30k foot level:
1) Search through the table and figure out which nodes have kids. Throw the parents in an array. This reduces our overhead considerably.
2) Figure out which nodes are open. Again, this reduces our overhead considerably, because due to the layout of this screen it's going to be the rare user that has more than one or two trees expanded. (This data is already globally available from the user record without an extra query being run.)
3) Since we're interested in display and not building a large branched data structure, call a display function. This display function starts at the top... it grabs the list of "Trunk" level nodes from that first query, and then it starts going down the list.
Here are the different states and actions.
If No Kids (Open status is inconsequential): Display an empty box and the short description.
If Kids, Not Open: Display a + box and the short description
If Kids, Open: Display a minus box, and the short description and recurse.

The recurse part is where it gets interesting. The function that handles the branching is going to have to be able to call itself. So it passes itself the ID that it wants to recurse from, the data from that first query, and the level of indent. It runs through the same exact procedure above, and each level recurses until it hits all nodes that have no kids, in which point it'll return. But the whole time, it's printing data onto the screen.

Voila. No fuss, no muss, just some brain sweat. A basic branching algorithm... it's figuring out the recursion part that's tough.
posted by SpecialK at 10:54 AM on September 27, 2004


Let MySQL do the work for you.

If you index your table by the 'childof' column and use SQL to retrieve the data, it will not scan every row in the table, it will essentially "order" all the rows by childof and make lookups very fast.
posted by vacapinta at 10:55 AM on September 27, 2004


Oh, and if you still think I'm being pigheaded about the javascript, try telling that to the guy who's paying me to do it, who uses an older Palm handheld that he has tricked out with ten times its purchase price in accessories of various types and is completely unwilling to give up for a newer handheld with a newer browser. He will be accessing this page via that handheld, and it'll already be tough enough to negotiate the app due to the format of the screen and the hard time I have getting a good menu to work.

On preview again: Yeah, I'm indexed, vacapinta. But there's still a small amount of latency in each query that runs, no matter how fast, and I'd prefer to keep that down to a decent minimum ... it could get painful, not to mention messy, quickly if I'm querying more than fifty or sixty times just for one page view and it's going to be querying over and over again as people access that page over and over to expand and collapse nodes.
posted by SpecialK at 11:00 AM on September 27, 2004


Doesn't MySQL 5 have stored procedures? That's how this is usually done (except in Oracle and DB2 which have better ways to do this). Here's a thread on how to do this with a stored procedure in Postgres, which may translate to MySQL (or you could switch to Postgres, it's lovely).

And here's a good paper on implementing hierarchies in SQL Server which covers the Celko method andrew cooke alluded to above - MySQL should be able to do something like it.
posted by nicwolff at 2:10 PM on September 27, 2004


If the underlying data changes infrequently, you could look into caching the output variations to insulate yourself from DB queries. Cache Lite is pretty easy to work with.
posted by sad_otter at 2:52 PM on September 27, 2004


fyi, the celko thing is page 2 of that link of nicwolff's (thanks, nice reference).
posted by andrew cooke at 4:07 PM on September 27, 2004


The tricky part about doing it in a single query is getting the sibling nodes in correct alphabetical order. A different schema design can make this part a lot easier, one where you store the path (hierarchy) in a separate field.
posted by SNACKeR at 5:18 PM on September 27, 2004


« Older Name That Tune   |   Where should I volunteer? Newer »
This thread is closed to new comments.