Finding longest sequence in a set of numbers
September 2, 2006 12:01 AM

I have a question about finding the longest sequence in a set of numbers.

I have a database with a column containing numeric (integer) serial numbers. There are lots of gaps, and I'd like to find the longest sequence of sequential numbers in the set.

This possibly laughable bit of PHP code below is how I would approach it. (There is some error checking missing from it in the name of clarity as well as checking "isset()" and other niceties.)

// $query = returned database results
// $tempLow = scratch field for start of current sequence
// $low = final sequential set's start
// $high = final sequential set's end

$low = $query[0];
$high = $query[0];
$tempLow = $query[0];
for ($a=1; $a > count($query); $a++) {
` if ( $query[$a] > $query[$a - 1] + 1 ) {
`` if ( ( $high - $low ) < ( $query[$a - 1] - $tempLow ) ) {
``` $high = $query[$a - 1];
``` $low = $tempLow;
`` }
`` $tempLow = $query[$a];
` }
}

Is there a better way? This obviously only returns the first occurence of the longest sequence if there is more than one sequence of that length, which is OK.
posted by maxwelton to Computers & Internet (13 answers total) 1 user marked this as a favorite
So what's wrong with the code you have now? Premature optimization is the root of all evil.
posted by grouse at 2:14 AM on September 2, 2006


It feels more natural to me just to use another array.
Something like this?
$query        =  array(1, 2, 3, 5, 7, 8, 9, 11, 16, 17, 18, 19, 20, 22);$longestarray =  array();$temparray    =  array();for ($a = 0 ; $a < count($query) ; $a++) {  if ($query[$a] == ($query[$a - 1] + 1) || count($temparray) == 0) {    array_push($temparray, $query[$a]);  } else {    if (count($temparray) > count($longestarray)) {      $longestarray = $temparray;    }    $temparray = array();  }}print "longest sequence: ";print join(',',$longestarray)
That actually comes up with the longest sequence, which is what you asked about, although your code comes up wtih the location of the longest sequence.
posted by AmbroseChapel at 2:20 AM on September 2, 2006


AmbroseChapel: there is no way that is more efficient in time or memory. Array operations are expensive; scalar operations are cheap (unless PHP is more screwed up than I imagined). And personally I find it less clear to see what is going on.

A simple optimization you could add to the original algorithm is that after you find a new longest subsequence, let's refer to the value of $a there as $old_a. You then add ($high - $low)+1 rather than 1 to $a. Then check to see whether ($query[$a] - $query[$old_a+1]) > ($high - $low). If no, then you have another long subsequence, and you can keep proceeding one by one up to the next gap. If yes, then you don't have a long subsequence, so you should work your way BACKWARDS to the previous gap, then add ($high - $low) + 1 again.

This is somewhat like the Boyer-Moore string searching algorithm. You will save a few examinations and in the worst case it shouldn't be any worse than examining every cell of the query. Except, of course, for the fact that it might in PHP because of the way PHP is implemented. Which is a definitive possibility.

Another wrinkle you could add is to divide and conquer, incrementing by half the remaining sequence length in the places where you add by one in the previously described algorithms. If that doesn't work out, you can backtrack.

But really, why bother?
posted by grouse at 2:37 AM on September 2, 2006


>there is no way that is more efficient in time or memory.

Fair enough, I'm sure you're right. But we're dealing with arrays, so it feels more sensible to treat them as arrays, not just note their coordinates, as it were, within another array.
posted by AmbroseChapel at 2:48 AM on September 2, 2006


First off, this is bad:

for ($a = 0 ; $a < count($query) ; $a++) {br>
...and should be changed to this:

for ($a = 0, $b = count($query); $a < $b; $a++) {br>
Otherwise you're calculating the count every iteration. I'll take a longer look at it after I get some coffee in me to see what else can be done.
posted by Civil_Disobedient at 4:49 AM on September 2, 2006


Wierd. Ignore strange "br[bracket]" characters at the end of the code lines. Matt really needs to fix his character parser.
posted by Civil_Disobedient at 4:50 AM on September 2, 2006


Also, what do you mean by "the longest sequence of sequential numbers"? Do you mean that each number is greater than the previous number (e.g., 1, 3, 4, 7), or do you mean that each consecutive number follows an actual sequence (e.g., 1, 2, 3, 4)? I'm having trouble visualizing the dataset based on your words. Could you give an example so I can "see" what you're talking about?
posted by Civil_Disobedient at 4:57 AM on September 2, 2006


This can be done using SQL. I'm assuming that you're looking for the longest sequence of sequential integers. My example table is called PARTS and just has one column, representing serial numbers.

# select * from parts;
serial
--------
1
2
3
5
7
8
9
11
16
17
18
19
20
22

# -- Obtain the serial numbers involved in the longest sequential sequence (postgres) --
# select x2.serial from (select rank, serial, (serial-rank) as diff from (select T1.serial, (select count(distinct T2.serial) from parts T2 where T1.serial >= T2.serial) as rank from parts T1) as x order by diff) as x2 where x2.diff in (select diff from (select rank, serial, (serial-rank) as diff from (select T1.serial, (select count (distinct T2.serial) from parts T2 where T1.serial>=T2.serial) as rank from parts T1) as x order by diff) as x2 group by diff order by count(*) desc limit 1);
serial
--------
16
17
18
19
20

Here's how it works: given a list of nonrepeating integers, we observe that subtracting each value's rank from itself results in a constant number for continuous sequences. [NOTE: the above SELECT assumes that all of the serial numbers are unique. Several DISTINCT keywords would have to be added if this is not the case]. For this example, we have:

01, 02, 03, 05, 07, 08, 09, 11, 16, 17, 18, 19, 20, 22 serialNum
01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14 rank
00, 00, 00, 01, 02, 02, 02, 03, 07, 07, 07, 07, 07, 08 diff

Counting the distinct diffs, we get:
0, 1, 2, 3, 7, 8 diff
3, 1, 3, 1, 5, 1 count(diff)

Which indicates that the longest sequence is 5 serial numbers long, each having a difference of 7 from their ordered rank.

Is this faster than doing it in PHP? With the above example, I'd say it's about even and point out that often db engines are running on faster iron than webservers, and that they will optimize the above query pretty well.

On the other hand, that query wastes a *lot* of time calculating rank, in an inefficient manner. If you kept a serial number index for that table instead, the above query could be simplified a great deal, and it would become extremely fast to execute. You'd have to reindex or autoindex the table after adding/removing records, of course, and prohibit duplicate serial numbers.
posted by ceribus peribus at 6:52 AM on September 2, 2006


Actually, that query might already work for duplicate serial numbers. I just didn't want to test it.
posted by ceribus peribus at 7:06 AM on September 2, 2006


Depending on how often this query is called (maybe even concurrently?) and how large your table is, you might not be able to use the PHP approach - it might make you run out of memory because your app has to load the whole table for each request.
posted by uncle harold at 7:45 AM on September 2, 2006


Unless I misunderstand your question, this is just the Longest Increasing Subsequence problem and has been studied to death, is implemented in Mathematica, and for which two reasonably fast algorithms (n squared and nlogn time, where n is the total number of numbers in the original sequence) are explained on the page I linked to.
posted by louigi at 2:56 PM on September 2, 2006


louigi, it's close, but I only want the subsequences where the integer is sequential, like 1,2,3,4 not 1,2,5,6.

Thanks everyone for the discussion. Grouse, I guess there's nothing wrong with my original code, other than some thing in the back of my head always bugs me when I do stuff like this. I'm more properly a hack, rather than a programmer, as I'm self-taught and undoubtedly leave vast and efficient parts of PHP and SQL untapped.

ceribus, I was wondering if it would be better to do it in SQL. There are no duplicate serial numbers.

The table currently only has 11,000 rows and the realistic cap is going to be 100,000 rows.
posted by maxwelton at 4:24 PM on September 2, 2006


Oh, an example dataset:

13567
13890
13891
13894
14067
14069
14070
14071
14102

etc. The serial numbers cover a range of 300,000 numbers and the 11,000 thus far in the database are scattered among all the possibilities. It would not be shocking to discover the longest sequence of sequential (incremented by one) numbers was three.
posted by maxwelton at 4:32 PM on September 2, 2006


« Older Global Television Question   |   An afternoon in Zurich - suggestions? Newer »
This thread is closed to new comments.