# How to look up all rows that overlap a certain value?

October 7, 2004 11:25 AM Subscribe

I am looking for an index data structure and query algorithm to solve a specific problem. Each row in the table represents a range, with a start/end value; I want to look up all the rows that overlap a certain value.

Putting it in SQL terms, I have a table with columns 'start', 'end', and 'data', and I want to run the following query:

select data from table where value >= start and value < end;br>

I'm working in C++, but don't need source code; I just can't remember the name or design of the algorithm/data structure that performs this kind of indexing.

Putting it in SQL terms, I have a table with columns 'start', 'end', and 'data', and I want to run the following query:

select data from table where value >= start and value < end;br>

I'm working in C++, but don't need source code; I just can't remember the name or design of the algorithm/data structure that performs this kind of indexing.

Sounds like you're looking for a range query algorithm.

You can use a universal b-tree to solve this problem, but it's not the easiest thing to get your mind around (link)

Some additional google searching on "range query" and "universal b-tree" should give some more resources.

This is also a generic solution and depending on the scope of the problem you're trying to solve, you might be able to use a more specific solution that is easier to implement (ex. known limited upper and lower bound, max number of items, etc)

posted by freshgroundpepper at 12:04 PM on October 7, 2004

You can use a universal b-tree to solve this problem, but it's not the easiest thing to get your mind around (link)

Some additional google searching on "range query" and "universal b-tree" should give some more resources.

This is also a generic solution and depending on the scope of the problem you're trying to solve, you might be able to use a more specific solution that is easier to implement (ex. known limited upper and lower bound, max number of items, etc)

posted by freshgroundpepper at 12:04 PM on October 7, 2004

Ah, yes, hmm, I should probably describe the characteristics of this problem more fully.

The source data set is fixed per run; we index it once and then run lots of queries. Indexing time is approximately irrelevant. Values are contiguous, ascending from zero to some limit defined by the source data, usually in the tens to hundreds of thousands. Every key value is guaranteed to match at least one row. Memory is cheap but not free; a degenerate hash table with one bin per input value would take up too much space.

posted by Mars Saxman at 1:01 PM on October 7, 2004

The source data set is fixed per run; we index it once and then run lots of queries. Indexing time is approximately irrelevant. Values are contiguous, ascending from zero to some limit defined by the source data, usually in the tens to hundreds of thousands. Every key value is guaranteed to match at least one row. Memory is cheap but not free; a degenerate hash table with one bin per input value would take up too much space.

posted by Mars Saxman at 1:01 PM on October 7, 2004

It sounds like you might be talking about a quadtree (or some other similar recursive space-filling algorithm) especially since your application is essentially a spatial application in one-dimension (where you are trying to find the intersection of a 1-d region and a 0-d region)

posted by vacapinta at 1:36 PM on October 7, 2004

posted by vacapinta at 1:36 PM on October 7, 2004

Using SQL I would use

SELECT *

FROM myTable

WHERE certainValue BETWEEN range_start AND range_end;

posted by neilkod at 1:47 PM on October 7, 2004

SELECT *

FROM myTable

WHERE certainValue BETWEEN range_start AND range_end;

posted by neilkod at 1:47 PM on October 7, 2004

Ahh, good thought, vacapinta - you're right, this is basically a 1D version of the standard point-in-rect hit test.

posted by Mars Saxman at 2:30 PM on October 7, 2004

posted by Mars Saxman at 2:30 PM on October 7, 2004

is log(n) good enough, or do you want O(1)? for log(n) i

but there must be an easier way. what do you know about arithmetic coding (compression)? i wonder if that is similar. equivalently the range representation for trees in sql from celko. i can't put my finger on it, but it seems like they should be connected somehow.

i'm working on a paper for a related problem at the moment. if that solution applies to this, i'll let you know (and use it as an example!), but it may be a week or so before it's clear (my subconscious solves these problems, and it doesn't work on a fixed schedule).

interesting problem! you might ask an sql expert how the sql query is optimized...

posted by andrew cooke at 3:16 PM on October 7, 2004

*think*you could pre-construct a decision tree where alternate nodes by depth select upper and lower bounds (and i think that's very close to the quadtree idea). so top level, you have the question "is the upper bound less than U1?", where U1 is the median upper bound. if less, branch left. next node asks "is the*lower*bound greater than L2?", where L2 is the median lower pound for intervals whose upper bound is less than U1. etc etc. you could either try to construct it balanced, or construct it taking the points as given and then find some kind of invariance-preserving change that you can apply repeatedly until some balance condition is met (something like rotations in a red-black tree). constructing it balanced is probably easier but slower.but there must be an easier way. what do you know about arithmetic coding (compression)? i wonder if that is similar. equivalently the range representation for trees in sql from celko. i can't put my finger on it, but it seems like they should be connected somehow.

i'm working on a paper for a related problem at the moment. if that solution applies to this, i'll let you know (and use it as an example!), but it may be a week or so before it's clear (my subconscious solves these problems, and it doesn't work on a fixed schedule).

interesting problem! you might ask an sql expert how the sql query is optimized...

posted by andrew cooke at 3:16 PM on October 7, 2004

(where "connected somehow" is defined as the relationship whereby, several weeks after you find the answer with sweat and tears, some smart-arse like vacapinta (i would use another more deserving target, but i'm at the wrong website) says "oh yes, of course - it's the categorical dual of [insert vague hunch here], so it's trivial" ;o)

posted by andrew cooke at 3:28 PM on October 7, 2004

posted by andrew cooke at 3:28 PM on October 7, 2004

do you really need all rows (or would one do)? what is the distribution of interval sizes compared to the full range?

posted by andrew cooke at 3:34 PM on October 7, 2004

posted by andrew cooke at 3:34 PM on October 7, 2004

Logarithmic time is OK for this application. Here are some values from one typical data set:

index range: 0 through 54246

table rows: 3555

min range: 0

max range: 747

mean range: 12.033

I do need all matching rows. I don't have actual measurements for this yet, but I'm expecting the typical query to return maybe three rows, rarely more than 6-7.

posted by Mars Saxman at 4:45 PM on October 7, 2004

index range: 0 through 54246

table rows: 3555

min range: 0

max range: 747

mean range: 12.033

I do need all matching rows. I don't have actual measurements for this yet, but I'm expecting the typical query to return maybe three rows, rarely more than 6-7.

posted by Mars Saxman at 4:45 PM on October 7, 2004

since your intervals are so small relative to your range you can (probably/hopefully) get away with sorting by midpoint, and, for each row, indicating the maximum number of indices before or after you need to go to include all overlapping intervals. then bisection sort to the midpoint closest (log (n)) and check all rows within the indicated range. it's not asymptotically nice, but if those numbers are typical then it's simple, easy and reasonably efficient.

(you need the number of indices because otherwise there's no knowing if there's a huge interval than you need to consider with a relatively distant midpoint. of course, if such intervals do exist then this becomes inefficient. you can patch

posted by andrew cooke at 5:01 PM on October 7, 2004

(you need the number of indices because otherwise there's no knowing if there's a huge interval than you need to consider with a relatively distant midpoint. of course, if such intervals do exist then this becomes inefficient. you can patch

*that*for a small number of large intervals by handling them separately. ugly as sin, but might be practical.)posted by andrew cooke at 5:01 PM on October 7, 2004

a small correction - you'd need to use the limits from the two rows with mid points either side of the value (rather than limits from the row with mid point closest to the value).

posted by andrew cooke at 5:13 PM on October 7, 2004

posted by andrew cooke at 5:13 PM on October 7, 2004

If

In the general case, I think the data structure you're looking for is the R-tree, which is implemented in various databases (e.g. PostgreSQL and Oracle) and a lot of GIS software. Guttman's original paper doesn't seem to be online anywhere but you should be able to find it in a decent university library (ACM SIGMOD 1984 proceedings).

posted by hattifattener at 6:02 PM on October 7, 2004

*all*of your intervals are relatively small, say at most X, then you can just sort by midpoint and extract all ranges within +-(X/2) of the midpoint of the range you're interested in, and then do a linear search through the results. If*most*of your intervals are small you could partition your data set accordingly.In the general case, I think the data structure you're looking for is the R-tree, which is implemented in various databases (e.g. PostgreSQL and Oracle) and a lot of GIS software. Guttman's original paper doesn't seem to be online anywhere but you should be able to find it in a decent university library (ACM SIGMOD 1984 proceedings).

posted by hattifattener at 6:02 PM on October 7, 2004

This thread is closed to new comments.

posted by Mars Saxman at 11:31 AM on October 7, 2004