Algorithm performance and throttling in PHP/Apache
May 13, 2006 4:34 AM   Subscribe

Efficient string array comparisons in PHP, and a question about Apache and PHP performance...

I have two arrays of same-length strings that I would like to compare. I have another two arrays that store probabilities ("scores") calculated for those strings. If the two strings match, I need to calculate another value from their respective probabilities. Here's the pseudocode:

run through all values of the string1-array
{
  run through all values of the string2-array
  {
    if string1 matches string2
    {
      score[string1-index] = function(string1's probability, string2's probability);
      remove string2 from string2-array;
      remove string2 probability from string2-probability-array;
      break;
    }
  }
  remove string1 from string1-array;
  remove string1 probability from string1-probability-array;
}


I am working with DNA strings, essentially arrays of up to 4^n values, for length n. For example, for a string length 7, in a worst case scenario, I can be comparing up to 16,384 string1 entries with up to 16,384 string2 entries.

The problem is that my PHP script does not complete this comparison, when the string length is 7 or longer. I either get a blank web page after about 15-20 minutes of work, or an error message that the network connection is broken.

First question: Is there a way to implement this comparison more efficiently — or adjust my PHP 5 or Apache 1.3 configuration to either log at what point exactly this code gets stuck or perhaps allocate more resources to solving this comparison?

I have added the "removal" code after a match to try to shrink down the comparison space on the following round of the loop. With or without this code, the script appears to take the same time +/- a percentage point on comparisons of length n. Probably because of the overhead of resizing the arrays. Comparisons of length 6 and shorter generally take a few minutes or less — I seem to be hitting a resource wall somewhere that drives the CPU to nearly 100% for a long time.

A related problem is that my web server slows to halt while the httpd process appears to allocate 95%+ CPU to this comparison task. I cannot open another session to the PHP application until the current comparison task is finished.

Second, related question: Is there anything I can adjust in PHP 5 or Apache 1.3 to throttle PHP's CPU usage, or any way to code this comparison, so that other users can connect to the website and use the application while this loop does it work?
posted by Mr. Six to Computers & Internet (16 answers total) 2 users marked this as a favorite
 
I think you might find array_diff helpful instead of looping through the variables. It returns a new array where one array's values are not in another array.
posted by Civil_Disobedient at 6:03 AM on May 13, 2006


For optimization, you could check-out the Zend optimizer (second recommendation for Zend software today! No way affiliated)

As for the arrays, could you throw us a sample database set? Also, the function within the IF control structure in the second LOOP could be your bottleneck, in the past I've combined the array_walk function with some regular expressions to handle large string arrays to some success. Granted, a lot of them were being run through command line and not a web page, so it may not be the best comparison.

I've +fave'd the question and I'll check back tomorrow. I think I have an example kicking around somewhere, but I'm heading out to a wedding right now and don't have the time to look for it.
posted by purephase at 6:32 AM on May 13, 2006


Be sure to tail your logs and turn on error_reporting(E_ALL) to catch everything. Do you have your max_execution_time and memory_limit upped in your php.ini? I suspect based on the combinatorics you're talking about, that's probably causing an issue.

array_diff looks like it might help. If that doesn't work, I would run an asort on both arrays. Then I would would put it into a tree or skip list, which will cut your searches down to log(n)

It's late, um early, and so I may be reading what you're trying to do completely wrong, but hopefully at least the first paragraph helps.

Since it sounds like you have shell access, have you thought about running your PHP via PHP-CLI? That way you don't have to worry about the Apache overhead.

Either way, you should be able to renice the process that's taking up the CPU (effectively throttling by lowering priority).

Reference:
* PHP options: http://us3.php.net/info
* PHP CLI: http://us3.php.net/features.commandline
* Renicing processes: http://www.linux-tutorial.info/modules.php?name=Tutorial&pageid=85

Sorts and trees aren't really worth talking about, but kip lists are actually kinda fun:
* http://www.csee.umbc.edu/courses/undergraduate/341/fall01/Lectures/SkipLists/skip_lists/skip_lists.html
* http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html
posted by lhl at 6:44 AM on May 13, 2006


1) Put the data into a database where you can index it and remove the need to loop through the same data over and over again.

2) If that's not possible, flip the arrays so that the strings you're searching for are the keys rather than the values, also indexing it and removing the need to loop through the same data over and over again.

3) Run it from the command line to take Apache out of the process.
posted by scottreynen at 7:12 AM on May 13, 2006


Scott, I like your train of thought. Since you're just doing a difference, put it all into mysql, index, and then do a query (SELECT a FROM tablea) MINUS (SELECT b FROM tableb).

Duh, along those lines, since your dealing with data that can be mapped into two bit vectors, you can trivially do set operations on them w/ any number of libraries: http://search.cpan.org/~stbey/Bit-Vector-6.4/lib/Bit/Vector/Overload.pod

BTW, Mr. Six, out of curiousity, why are you using PHP on Apache? Most people I know who work in bioinformatics use either Perl or (more and more) Python. Both seem better suited (more efficient, easily daemonized, much more robust math libraries).
posted by lhl at 7:44 AM on May 13, 2006


Response by poster: Let's say I have the following array objects:

$first_strings = $_SESSION['profiles'][$first_profile_index]['strings'];
$first_string_probabilities = $_SESSION['profiles'][$first_profile_index]['probabilities'];

$second_strings = $_SESSION['profiles'][$second_profile_index]['strings'];
$second_string_probabilities = $_SESSION['profiles'][$second_profile_index]['probabilities'];


Example:

first_strings = Array ( [0] => "ATAG", [1] => "GGCT", ... )

All the strings within $first_strings are unique, no duplicates. Same for $second_strings. However, I have no guarantee that all the unique strings in $first_strings are in $second_strings, and vice versa.

My function requires knowing the indices of the first and second strings:

if (both strings match)
{
$H01 += $first_string_prob[first_index] * log ( $first_string_prob[first_index] / $second_string_prob[second_index] );
$H10 += $second_string_prob[second_index] * log ( $second_string_prob[second_index] / $first_string_prob[first_index] );
}


I don't think array_diff helps here because it doesn't give me the index of the strings where there is a match. All it tells me is what isn't in one array.

I need to run this within a web application because I pass the data from page to page in a $_SESSION variable.

I'd like to avoid having to set up a database although if that really would solve the problem I'll put efforts in that direction.

I set my php.ini resource limits as follows:

max_execution_time = 30000
max_input_time = 60000
memory_limit = 128M


My error reporting is already set to:

error_reporting = E_ALL

Is there a programmatic way to create a key=>value array from my arrays above, such that the key would be the string and the value would be its probability?

I know how to do something like $testarray = array("ATTAG" => "0.004", "CCCTA" => "0.0032", ...) but I obviously can't write that out for 16K+ records.

Since I don't know which httpd process will run at 95-100% I don't see how I can know ahead of time which process to renice. While I can renice during development since I'm the only one running the app, I can't do that when the application is running full-time.
posted by Mr. Six at 7:44 AM on May 13, 2006


Response by poster: I'm using PHP because my Perl and Python are really weak and I need to write code quickly. I wish I had the time to learn how to embed these languages into a web app but I don't have that luxury at the moment.
posted by Mr. Six at 7:48 AM on May 13, 2006


Just realized you want matches not the differences. Just switch what I said about differences w/ intersections.

Also, if you're problem is just that you're losing data, you could just checkpoint your data every x-number of cycles by writing it to disk (or writing your array pointers every cycle). It seems like if you can use one of the set techniques though that it'd be much easier.

Even in PHP, which doesn't have a real vector class you can do 4^16 (32-bit) length bit-wise operations...
posted by lhl at 7:53 AM on May 13, 2006


Best answer: So, as I understand it, right now your array is of the form:

$dna_array[$i] = 'GATTACA';
$probability_array[$i] = '0.01';

So just do a loop:

foreach($dna_array as $k => $v) {
$probability_hash[$v] = $probability_array[$k];
}

Do that for your other dna/probability pair. Now that you have you can find your probability via your dna code, you can easily run any sort of intersection/set matching code and then retrieve the probability for the matching code.
posted by lhl at 8:03 AM on May 13, 2006


before i check out, just want to mention that array_flip() will do key/value flipping automatically. 1 line instead of 2.
posted by lhl at 8:25 AM on May 13, 2006


I definitely look at running it from the command line -- that way, you can watch it work so you know it's doing something.

Another option -- just run it on your local machine, particularly if the web server is a shared resource. Is there some reason not to? You could always connect to the shared DB to save the results when you're done.
posted by ph00dz at 8:35 AM on May 13, 2006


I had a similar snafu with a different kind of data. A majority was cleared up with what people have said here (farm it out to CLI, better functions, usa a dB, etc) but the tack I took may also help:

Rather than use the webserver to do the calculations - thus halting the web app - I took the 'shopping cart' approach and let people stack their queries into a queue. These are then farmed out to an array of boxen whose sole job is number crunching, leaving the webserver to handle the web app.

Overall, something like:
1) Web app places the query and dataset into a database as a pending job
2) Pending job is added to the user's 'queue'/session. Basic queue status and management functions are built into the web app
3) Jobs are farmed out round-robin style to a cluster of machines (rented, offsite, whatever) that save the results back to the dB. Results then sent to user.

It's a little more work, but it has the added bonus of letting you throw as many machines as you can get your hands on into the cluster.
posted by bhance at 8:47 AM on May 13, 2006


If I'm understanding the problem, your current code is a hack that relies heavily on arbitrary array indexes to compensate for your lack of understanding associative arrays. Go familiarize yourself with associative arrays and you can lookup the DNA sequences by using the DNA sequences as your indexes instead of some arbitrary index that has nothing to do with the problem you're trying to solve. This will remove all need for looping, drastically improve the speed, and have the added benefit of making sense to anyone else who needs to look at your code in the future.
posted by scottreynen at 9:08 AM on May 13, 2006


Best answer: set string2-array up as an associative array:

$s2probs[$name]=$p;

The remove the inner loop and replace with $s2prob=$s2probs[$string1];

Also try loading it all into a copule of MySQL memory tables and doing a join.
posted by cillit bang at 10:48 AM on May 13, 2006


Let me see if I have this right.

You have four arrays. Two of the arrays are for storing strings. Two more arrays for storing probabilities. And you're using the absolute indexes as a method of retrieving the probability data in order to compute a function that is then stored in an additional, fifth array? This is a wildly inefficient use of resources.

If you insist on passing 2^14 variables in session data (which I highly recommend against), you can easily get rid of two arrays off the bat by creating an associative array as others have mentioned above. If you, instead, turned the data into two tables in a database, you could do a simple select query to return all unique matching values in a result set that would compute your log function.

Here's how you'd do it:

LIST1, LIST2:
ID -> Primary Key, Int
sequence -> VARCHAR(50)
probability -> FLOAT

SELECT query
select "sequence" = LIST1.sequence, "probability" = (LIST1.probability * (log(LIST1.probability / LIST2.probability)))
into LIST3
from LIST1, LIST2
where (LIST1.sequence = LIST2.sequence)

That's it.
posted by Civil_Disobedient at 11:51 AM on May 13, 2006


FYI - The above statement would execute about a zillion times faster than any PHP loop, you won't have any memory problems, you won't cause your server to crash and burn, and you'll get a nice, tidy new table created that has all your results. No fuss, no muss.
posted by Civil_Disobedient at 11:53 AM on May 13, 2006


« Older Get out of my head   |   How to ace a committee interview? Newer »
This thread is closed to new comments.