# Unclump my histogram

August 17, 2010 5:03 AM Subscribe

How to smooth a histogram or unclump a dataset.

I'm working on a 2D UI-type graph. The objects in the graph don't have to be plotted "realistically" they just need to be stable, so it doesn't matter if I screw up the numbers as long as I do it predictably.

The reason I need to do any screwing at all is that the datapoints are very clumpy. I spent most of yesterday making my own function to unclump it and I'm not all that excited by the results. I'm wondering if there's Known Method for this I should be using or if I just need to tweak my method.

Here's what I'm doing. First, in both axes I start with the data normalized to 0-1. Then for each axis I histogram it into N bins (for N in 10, 100, 360 so far). For each bin, I calculate a rescale value by dividing the count in the bin by the total sample size. Then I rescale each bin proportional to that amount.

The problem is, I'm still getting big unoccupied areas on the graph all the way across either axis. I

I'm working on a 2D UI-type graph. The objects in the graph don't have to be plotted "realistically" they just need to be stable, so it doesn't matter if I screw up the numbers as long as I do it predictably.

The reason I need to do any screwing at all is that the datapoints are very clumpy. I spent most of yesterday making my own function to unclump it and I'm not all that excited by the results. I'm wondering if there's Known Method for this I should be using or if I just need to tweak my method.

Here's what I'm doing. First, in both axes I start with the data normalized to 0-1. Then for each axis I histogram it into N bins (for N in 10, 100, 360 so far). For each bin, I calculate a rescale value by dividing the count in the bin by the total sample size. Then I rescale each bin proportional to that amount.

The problem is, I'm still getting big unoccupied areas on the graph all the way across either axis. I

*think*the problem is that the bins don't fall on the clump boundaries so when I rescale I'm also scaling up some gaps. So maybe I should not assume constant bins sizes and instead have it "discover" the clumps and make bins around them? Or just increase the bin "resolution" by making them much smaller?

Response by poster: Why would small bin sizes cause a problem?

I can't get the histogram here, but the data source is non-geo satellites. Mean motions (which vary from about 2-16) and inclinations (which vary from about 0-150). Some of these regions are well-populated. The mmot in particular is very clumped at the high end, so I actually started with a log(17-mmot).

The source is probably quantized to...3 or 5 decimals? I normalized to 1, so I probably have...8? decimals of precision now. It doesn't look quantized zoomed in reasonably far on the graph (which I've seen before so I know what it looks like).

posted by DU at 5:19 AM on August 17, 2010

I can't get the histogram here, but the data source is non-geo satellites. Mean motions (which vary from about 2-16) and inclinations (which vary from about 0-150). Some of these regions are well-populated. The mmot in particular is very clumped at the high end, so I actually started with a log(17-mmot).

The source is probably quantized to...3 or 5 decimals? I normalized to 1, so I probably have...8? decimals of precision now. It doesn't look quantized zoomed in reasonably far on the graph (which I've seen before so I know what it looks like).

posted by DU at 5:19 AM on August 17, 2010

*Why would small bin sizes cause a problem?*

Large bin sizes in a histogram have a smoothing effect. Compare these two samples from the normal distribution.

If I use a bin size of 2, it's "clumpy", but if I use a bin size of 10, it's more smooth.

posted by chrisamiller at 5:33 AM on August 17, 2010

To clarify, those are the same set of 200 values, just plotted with different bin sizes.

posted by chrisamiller at 5:34 AM on August 17, 2010

posted by chrisamiller at 5:34 AM on August 17, 2010

Response by poster: Talking with a coworker, I just realized that I kind of want the opposite of a histogram. In a histogram, you get different heights for a constant width. What I'm trying to get is (more or less) constant heights for differing widths.

I should probably not have even brought up histograms for another reason too: The histograms I mentioned are internal for calculation purposes. The actual display graph is a 2D "dot" graph. I'm just trying to non-linearly (or piecewise linearly) scale both axes to alleviate the clumping. Histograms got dragged in when I was counting how many dots are in each section of the graph.

posted by DU at 5:43 AM on August 17, 2010

I should probably not have even brought up histograms for another reason too: The histograms I mentioned are internal for calculation purposes. The actual display graph is a 2D "dot" graph. I'm just trying to non-linearly (or piecewise linearly) scale both axes to alleviate the clumping. Histograms got dragged in when I was counting how many dots are in each section of the graph.

posted by DU at 5:43 AM on August 17, 2010

So what you want is to partition the dataset into variably-sized bins, all with the same # of data points in each?

Something like this seems easy enough: First, take the total number of data points, divide it by the number of bins you want to use and that gives you the number of points in each bin. Then sort the data points, step through them, and draw a dividing line every time you fill a bin.

There are cases where this won't work well. For example, let's say you have 100 data points, you want 10 windows, but in your list of values, the number 15 comes up 30 times. There's no way to subdivide that neatly without some questionable hacks where you divide data points with the same value into multiple bins.

posted by chrisamiller at 5:49 AM on August 17, 2010

Something like this seems easy enough: First, take the total number of data points, divide it by the number of bins you want to use and that gives you the number of points in each bin. Then sort the data points, step through them, and draw a dividing line every time you fill a bin.

There are cases where this won't work well. For example, let's say you have 100 data points, you want 10 windows, but in your list of values, the number 15 comes up 30 times. There's no way to subdivide that neatly without some questionable hacks where you divide data points with the same value into multiple bins.

posted by chrisamiller at 5:49 AM on August 17, 2010

Trying to talk through this is a bit awkward. If you could provide a screenshot or drawing of what you have and what you want, it would be immensely helpful. You know, picture worth a thousand words and all that.

posted by chrisamiller at 5:51 AM on August 17, 2010

posted by chrisamiller at 5:51 AM on August 17, 2010

Maybe a log transform (or some other kind of monotonic transform)?

posted by Sutekh at 5:56 AM on August 17, 2010 [1 favorite]

posted by Sutekh at 5:56 AM on August 17, 2010 [1 favorite]

Are you attempting Histogram Equalization? It's supposed to turn your histogram into a straight line.

posted by Mike1024 at 5:58 AM on August 17, 2010

posted by Mike1024 at 5:58 AM on August 17, 2010

Response by poster: Sorry everyone, false alarm. I thought I was running into a theoretical difficulty when it was actually just a "bug" in the data. For some values, I have many data points with the exact same value. So they were falling on top of each other on the graph, making it look like some bins were underpopulated when in fact they were just standing on each other's toes. Counting *distinct* values per bin fixed that right up and now the graph looks much more reasonable.

posted by DU at 6:12 AM on August 17, 2010

posted by DU at 6:12 AM on August 17, 2010

The magic words here are "kernel density." You can futz with the bandwidth parameter to get any degree of smoothing that you want.

Almost any statistical package will do a kernel density for a single variable. R and Stata have bolt-ons that claim to do kernel densities for the joint distribution of two variables, but I've never used them.

posted by ROU_Xenophobe at 7:42 AM on August 17, 2010

Almost any statistical package will do a kernel density for a single variable. R and Stata have bolt-ons that claim to do kernel densities for the joint distribution of two variables, but I've never used them.

posted by ROU_Xenophobe at 7:42 AM on August 17, 2010

Without a better idea of what you're trying to accomplish with the histogram/graph, it's hard to give you coherent advice on how to solve your problem.

Regarding "clumpy" histograms: you can try using a kernel density estimator that ROU_Xenophobe mentioned, or use an adaptive bin-size histogram.

posted by scalespace at 10:00 AM on August 17, 2010

Regarding "clumpy" histograms: you can try using a kernel density estimator that ROU_Xenophobe mentioned, or use an adaptive bin-size histogram.

posted by scalespace at 10:00 AM on August 17, 2010

This thread is closed to new comments.

An idea of the source of the data would be useful too, has it been quantized somehow?

posted by Morsey at 5:09 AM on August 17, 2010