Zip Code Radius Searches That Aren't Painful
August 28, 2006 2:11 PM   Subscribe

What is the best strategy to limit using processing resources in doing a zip code radius query? (explained inside)

Imagine a scenario where you had a number of MySQL database records with information, including a zip code, and a cooresponding latitude and longitude associated with that zipcode.

Imagine now that you wanted to allow someone to see all records that fall within a certain radius of their own zipcode. In other words, I input that I am located in 60661, and a PHP program determines which of the MySQL records are within 10 miles of 60661.

I have seen lots of PHP/MySQL radius scripts, and have a fully accurate and complete list of zip codes with latitudes and longitudes. So my question isn't about trying to find them.

My question is: What is the best strategy to limit using processing resources in doing a query like this? Is it to predetermine what zipcodes fall within X miles, then get all records with those zipcodes? Is it to simply compare distances for every record, selecting only those that are within X miles? Is it to somehow do the latitude/longitude trig inside a MySQL query?

There seem to be a lot of people marketing "solutions" for problems like these, but the problem can't be that complicated, and since the trig functions to determine the distance between two zip codes are free and prevalant, I'd rather not pay for a half. Any thoughts are appreciated.
posted by JakeWalker to Computers & Internet (12 answers total) 2 users marked this as a favorite
"Is it to predetermine..."?

Yeah. Create a table containing zipcode-to-zipcode distances. Feel free to limit the number of records to those zipcodes within, say, 100 miles.

Then, select * from zipcode where zipcode_to_zipcode.distance < querydistance and zipcode_to_zipcode.first_zipcode=queryZipCode andbr> zipcode.zipcode = zipcode_to_zipcode.second_zipcode

I'm sure you're aware that this is a very error-prone search, since zipcodes are irregularly shaped and of different sizes.
posted by lobrien at 2:39 PM on August 28, 2006 [1 favorite]

Well, if performance is going to be critical, you might consider static pages for each ZIP code. I don't know how many ZIP codes are active, but consider:

- 99,999 possible ZIP codes
- Say, 100 nearby codes for each ZIP
- You save an HTML fragment that uses 20 bytes per zip (maybe ul tags and the number itself.)

If you precalculate each ZIP code, you'll use 2 kbytes per file, for a total of 199998 kbytes = about 200 MB. You'll want to split them into many directories for best performance (60661 would be in 6/60661.html, or 6/0/60661.html). Just recreate the files when your data changes, and your individual transactions are wicked fast.
posted by pocams at 2:40 PM on August 28, 2006

You already have this data, but for others that may be thinking about this problem: there are roughly 76,000 active ZIP codes.
posted by pocams at 2:42 PM on August 28, 2006

You can do a screening pass using a square range rather than a circle. This should be doable in pure SQL with comparisons rather than having to do a calculation for every zip. Then you can do the proper trig calc on the limited results.
posted by smackfu at 2:51 PM on August 28, 2006

You can do a screening pass using a square range rather than a circle.

I'll echo this suggestion -- just recently did exactly this. We looked at calculating the distance for each zip code pair, but that does take up a big amount of space and the site was going to go on shared hosting where space was going to be tighter than the requirements, so.... to grab potential candidates from the database, we did something like this:

SELECT * from zipcodes WHERE latitude > $low_lat AND latitude < $hil_at AND longitude > $low_lon AND longitude < $hi_lon ORDER BY zip ASC

And then we computed the Haversine distances for this set to get the final set inside the actual circular radius.
posted by weston at 4:14 PM on August 28, 2006 [1 favorite]

Re. the size of the lookup table: while there are 76K active zipcodes, the lookup table row count need only be the square of the count of unique zips in the _existing_ database. Having said that, pruning the rows so that the search space is smaller is a fine idea (e.g., only compute distances between zips in the same or adjoining states). The dynamic calculation against a square range is a good idea, but it's performance ought to be measured against the lookup table.
posted by lobrien at 4:34 PM on August 28, 2006

the lookup table row count need only be the square of the count of unique zips in the _existing_ database

If n is the number of unique zips, you can actually get away with ∑ n, rather than the square of n, if you impose a canonical order on the zip code pairs. This cuts the number of pairs you need by about half. However, it can still get to be a big number quickly.
posted by weston at 4:52 PM on August 28, 2006

Another way to handle the "n squared" problem with the lookup table is to do this instead:

1) Determine the zips within radius.
2) Stick those within-radius zips in a temp table or a scratch table. Make sure it's indexed, and the zip column to your main data table indexed.
3) Join the within-radius zips to your main data table.

I think this is a faster solution. Databases are set up specifically to make joins fast. So if you're not just doing a straight select, and you can translate your solution into one that uses joins, it's usually faster.

Note, however, that you may lose a lot of performance to sorting in the bargin. Sorting is 5x-10x slower than a straight join on medium sized result sets -- especially ones that require a temp table to complete. Your query will be dominated by sort time even with a quick select.

Unless your data is already sorted "naturally" by an index in the correct order.

And that depends on exactly which database you use, exactly which type of tables you use, exactly which indexes you have set up, and exactly what your select looks like. Sometimes indexes make all the difference in the world, and sometimes they're useless. It just depends on whether the series of joins can be done while still respecting the index sort orders.

So try removing the sort to see how much faster it gets. Try the join versus the select box. it's important to set up acid tests of all these ideas, really read up on your index options, and see what works and what doesn't for yourself.
posted by maschnitz at 6:31 PM on August 28, 2006 [1 favorite]

(By the way, you really should have your zip codes as ints if you're going to do a join like that. That'll make the join almost 2x faster.)
posted by maschnitz at 6:33 PM on August 28, 2006

I know you asked about MySQL, but i thought i'd point out that postgreSQL has geometric data types and functions which can be used to ask these questions directly (though i guess you'd need to treat longitude/latitude as points on a 2D plane, which i suppose they aren't for large enough areas).

Offloading this kind of massive data sifting work to the database layer is what RDBMSes are all about!
posted by dkg at 7:10 PM on August 28, 2006 [1 favorite]

One hitch with going with PostgreSQL geometric data types: you'd have to use an ellipse, not a circle.

Using a geometric point assumes a cylindrical projection of the Earth's surface. (It's sorta like a Mercator projection, but different.) The cylindrical projection turns the sphere's search circles ellipsoid as you head towards the North and South Poles. The effect is noticeable in the northern part of the mainland US, and very noticeable in Barrow, Alaska.

There are database packages and addons that use geospatial datatypes (as opposed to geometric datatypes). They have the advantage over a lat/lon pair of much more complicated, and efficient, 2D indexing schemes. The circular Haversine-distance radius also are built-in, an advantage over geometric points. But they tend to be vertical market or fringe products.
posted by maschnitz at 1:17 AM on August 29, 2006

The answer is NOT TO USE SQL! There are data structures and algorithms designed to cluster items by locality in however many dimensions you care about; 2 in this case.

I suggest you look up Multidimensional Access Methods, R-Trees, SR-Trees, X-Trees, TV-Trees and KD-Trees.
posted by polyglot at 2:53 AM on August 29, 2006

« Older Scared of the future   |   Force javascript to open a new window? Newer »
This thread is closed to new comments.