Not quite a voronoi Diagram
July 8, 2022 3:07 AM   Subscribe

How can I split a group of geographic points into n regions containing equal number of points. (Using R)

So, my first thought was a Voronoi diagram, but it seems that takes a region and uses the points as seeds. What I want to do is take my great big list of lat/long coordinates and split them up into say 20 regions each with the same number of points.

I'll be doing the calculations in R but I'd bet that if this process had a name I could find a way of doing it, but I'm not finding much on a basic search.
posted by Just this guy, y'know to Science & Nature (19 answers total) 4 users marked this as a favorite
 
Could you do a k-nearest neighbors approach?
posted by quadrilaterals at 4:29 AM on July 8, 2022 [1 favorite]


The usual thing is a clustering algorithm, but that's a different problem than yours. Clustering in the most simple case is done by finding the two points that are closest together, and declaring them to be a cluster. Then the next two closest points, etc, adding to clusters and developing new clusters.

You could start out that way, and proceed until you have 20 clusters, then only add to clusters until all points are assigned.

You would then need an additional step to even out the numbers of points. I use a similar logic, and pick a point in the largest cluster that's close to a lesser cluster and move it. Over and over.
posted by SemiSalt at 4:36 AM on July 8, 2022 [1 favorite]


You could probably use approaches that have their origins in building equal population voting districts- your populations are just small.
posted by rockindata at 5:00 AM on July 8, 2022 [1 favorite]


Looks like geomander MIGHT get you in the right direction?
posted by rockindata at 5:04 AM on July 8, 2022 [1 favorite]


How can I split a group of geographic points into n regions containing equal number of points

In any number of ways. Your problem specification needs additional constraints.

For example, just sorting the entire list of points by latitude first and longitude second and then dividing it into N equal size groups will give you your N regions, but quite probably not the kind of shape you're thinking of.
posted by flabdablet at 6:56 AM on July 8, 2022 [6 favorites]


Also, once you do start putting on those additional constraints, you're going to run into issues with pathological lists that force you to prioritize between e.g. minimizing the total distances between the points within a group and maintaining equality of group sizes. So your spec needs to be absolutely clear about what you want these groups for and why.
posted by flabdablet at 7:02 AM on July 8, 2022 [3 favorites]


Response by poster: Very good points thanks.

Ok, some additional context.
I have a large number of points (maybe 20,000?) that need to be visited. I have 20 people to do the visiting.
So this is essentially a variant Travelling Salesperson problem.

The question is, where is the optimum location for each of my salespersons to start from.
It's a bit a variation, because they'll be doing a bunch of points one day, then going home and then a bunch of points the next and so on. So it's not like a continuous TSP tour.

But at the moment I'm just looking at the first part, the clustering.
posted by Just this guy, y'know at 8:10 AM on July 8, 2022


I tried to do this for a research study a few years back. It was difficult for me to figure out the computing answer. I ended up using QGIS, importing all the points (several hundred), and then manually creating groups of points near each other. I would often select too many or too few points, but could re-lasso an area excluding one or two. I was able to visually determine what a reasonable cluster would be in a way that I could not specify clearly enough for an algorithm. (That is - exactly what flabadabet said: I could computationally create groups, but they did not meet the geographic criteria I needed.)

This is not an elegant answer, nor is it replicable science, and I was frustrated to do the work by hand. But it worked for a one-off.
posted by quadrilaterals at 8:31 AM on July 8, 2022 [2 favorites]


I would use a clustering algorithm like k-means, which is well implemented in the clustR package ( I think spelled that way?). Ask for 20 groups. Then, take those 20 centroids as your 20 starting points. Program a “draft”, where you cycle through each centroid and pick the closest point. Repeat until the points are gone. Alternatively, look at the 20 clusters, and reassign points from the big clusters to the little clusters - might be faster.

Either way, a fun problem!
posted by Valancy Rachel at 8:36 AM on July 8, 2022 [1 favorite]


Could you use kmeans over an engineered distance metric? Step 1, warp the space so that all points are evenly spaced on a grid. Step 2, kmeans.

Maybe the "grid" can be one-dimensional, in which case all you need is a list of points in a spatially sensible order which you then break into k consecutive groups. You might start with an extreme point and then just recursively add the next nearest point.

I guess the problem would then be choosing a warping of the map that is conservative of whatever your objective is. It sounds like euclidean distance might still matter for your application, so you're going to be looking for the transformation into an equally spaced grid that minimizes some kind of difference between that and your new metric. Tbh this sounds like a problem for a neural network, because you're learning a smooth but not linear transform from one space to another subject to both hard and soft objectives. But I'm not sure I can specify it precisely. Like imagine that you train a network with the following objective: in the output space, minimize an appropriately weighted sum of (1) difference in kmeans cluster sizes and (2) square distances within each cluster relative to the original euclidean distance metric.
posted by grobstein at 9:23 AM on July 8, 2022


You should ask this on an appropriate Stack* site though if you haven't!
posted by grobstein at 9:25 AM on July 8, 2022


Do you specifically want each partition to have the same number of points, or could you want them each to be the same amount of traveling for their 'salesperson'?
posted by away for regrooving at 9:35 AM on July 8, 2022 [1 favorite]


If these are actual people doing actual travel, your distance metric needs to be based on the modes of transport available to them, not on Euclidean distance.

Given that you only have 20 people to allocate clusters to, I would expect eyeballing this to be way faster than coding it, as well as involving only failure modes that are much less boneheadedly insane.
posted by flabdablet at 9:42 AM on July 8, 2022 [4 favorites]


A version of this question got asked on StackOverflow, originally posted 11 years ago but with answers and updates still coming in.
I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce equally sized groups.

Is there a variation of this algorithm or a different one that allows for an equal count of members for all clusters?
There are tons of different suggestions, some of which include example implementations or library functions that do something like this. Not 100% on your spec so not sure if one of these works but worth looking through and maybe trying.
posted by grobstein at 4:17 PM on July 8, 2022 [2 favorites]


Do you want to include travel time? We could imagine these travelling salesmen working 8 hrs per day: the city salesmen visit 6 customers each with 10 minutes travel between customers, the country salesmen visit 1 customer per day with 7 hrs travel time, and the suburban... 4 customers a day.
posted by at at 10:22 PM on July 8, 2022 [1 favorite]


If you do end up doing some programming that moves points among clusters, watch out for endless loops when a point is moved back and forth between two groups. Other, more sophisticated traps may also await.
posted by SemiSalt at 5:54 AM on July 9, 2022 [1 favorite]


So are you trying for to automatically cut turf for canvassing? If you are, I would encourage you to see if you can find people with local knowledge to cut the turf instead. My experience with automatic turf cutting is…not great.
posted by rockindata at 7:44 PM on July 10, 2022 [1 favorite]


Again, you need to decide which is more important to you - equality of group sizes, or minimization of distances between points within any given group? Because there will always be cases that put these aims at odds.

Consider, for example, a list of 20000 points with five widely spaced cities containing 1500, 1700, 1900, 2100 and 2500 points, another eight widely spaced towns containing 800±100 points each, and the remaining ~4000 points sprinkled more or less evenly across the whole map.

I cannot see any reasonable way to reconcile the equal-sized-groups requirement with a sane clustering requirement in this case, and doubt the existence of an algorithm that could. I second the idea of delegating the clustering task to people with appropriate local knowledge if at all possible.
posted by flabdablet at 1:10 AM on July 11, 2022 [1 favorite]


Response by poster: I actually think it may be even more complicated than that, because I need to cluster by anticipated number of points visited in a year, so if they are widely spaced then they should have fewer points because the travel time between will be greater, but if they are densely packed then they can have more points because the time spent travelling is shorter.
posted by Just this guy, y'know at 8:01 AM on July 11, 2022


« Older Once October term starts, how soon can the Supreme...   |   Make Me Scream Newer »
This thread is closed to new comments.