Comments on: Unclump my histogram
http://ask.metafilter.com/162478/Unclump-my-histogram/
Comments on Ask MetaFilter post Unclump my histogramTue, 17 Aug 2010 05:09:43 -0800Tue, 17 Aug 2010 05:09:43 -0800en-ushttp://blogs.law.harvard.edu/tech/rss60Question: Unclump my histogram
http://ask.metafilter.com/162478/Unclump-my-histogram
How to smooth a histogram or unclump a dataset. <br /><br /> I'm working on a 2D UI-type graph. The objects in the graph don't have to be plotted "realistically" they just need to be stable, so it doesn't matter if I screw up the numbers as long as I do it predictably.<br>
<br>
The reason I need to do any screwing at all is that the datapoints are very clumpy. I spent most of yesterday making my own function to unclump it and I'm not all that excited by the results. I'm wondering if there's Known Method for this I should be using or if I just need to tweak my method.<br>
<br>
Here's what I'm doing. First, in both axes I start with the data normalized to 0-1. Then for each axis I histogram it into N bins (for N in 10, 100, 360 so far). For each bin, I calculate a rescale value by dividing the count in the bin by the total sample size. Then I rescale each bin proportional to that amount.<br>
<br>
The problem is, I'm still getting big unoccupied areas on the graph all the way across either axis. I <i>think</i> the problem is that the bins don't fall on the clump boundaries so when I rescale I'm also scaling up some gaps. So maybe I should not assume constant bins sizes and instead have it "discover" the clumps and make bins around them? Or just increase the bin "resolution" by making them much smaller?post:ask.metafilter.com,2010:site.162478Tue, 17 Aug 2010 05:03:13 -0800DUhistogramclumpsmathBy: Morsey
http://ask.metafilter.com/162478/Unclump-my-histogram#2333171
My initial reaction is that your bin sizes are too small rather than too large - any chance of seeing one of your histogram would make it a great deal easier to understand. <br>
<br>
An idea of the source of the data would be useful too, has it been quantized somehow?comment:ask.metafilter.com,2010:site.162478-2333171Tue, 17 Aug 2010 05:09:43 -0800MorseyBy: DU
http://ask.metafilter.com/162478/Unclump-my-histogram#2333177
Why would small bin sizes cause a problem?<br>
<br>
I can't get the histogram here, but the data source is non-geo satellites. Mean motions (which vary from about 2-16) and inclinations (which vary from about 0-150). Some of these regions are well-populated. The mmot in particular is very clumped at the high end, so I actually started with a log(17-mmot).<br>
<br>
The source is probably quantized to...3 or 5 decimals? I normalized to 1, so I probably have...8? decimals of precision now. It doesn't look quantized zoomed in reasonably far on the graph (which I've seen before so I know what it looks like).comment:ask.metafilter.com,2010:site.162478-2333177Tue, 17 Aug 2010 05:19:46 -0800DUBy: chrisamiller
http://ask.metafilter.com/162478/Unclump-my-histogram#2333186
<em>Why would small bin sizes cause a problem?</em><br>
<br>
Large bin sizes in a histogram have a smoothing effect. <a href="http://imgur.com/GvHlW">Compare these two samples from the normal distribution</a>.<br>
<br>
If I use a bin size of 2, it's "clumpy", but if I use a bin size of 10, it's more smooth.comment:ask.metafilter.com,2010:site.162478-2333186Tue, 17 Aug 2010 05:33:31 -0800chrisamillerBy: chrisamiller
http://ask.metafilter.com/162478/Unclump-my-histogram#2333187
To clarify, those are the same set of 200 values, just plotted with different bin sizes.comment:ask.metafilter.com,2010:site.162478-2333187Tue, 17 Aug 2010 05:34:02 -0800chrisamillerBy: DU
http://ask.metafilter.com/162478/Unclump-my-histogram#2333191
Talking with a coworker, I just realized that I kind of want the opposite of a histogram. In a histogram, you get different heights for a constant width. What I'm trying to get is (more or less) constant heights for differing widths.<br>
<br>
I should probably not have even brought up histograms for another reason too: The histograms I mentioned are internal for calculation purposes. The actual display graph is a 2D "dot" graph. I'm just trying to non-linearly (or piecewise linearly) scale both axes to alleviate the clumping. Histograms got dragged in when I was counting how many dots are in each section of the graph.comment:ask.metafilter.com,2010:site.162478-2333191Tue, 17 Aug 2010 05:43:38 -0800DUBy: chrisamiller
http://ask.metafilter.com/162478/Unclump-my-histogram#2333194
So what you want is to partition the dataset into variably-sized bins, all with the same # of data points in each? <br>
<br>
Something like this seems easy enough: First, take the total number of data points, divide it by the number of bins you want to use and that gives you the number of points in each bin. Then sort the data points, step through them, and draw a dividing line every time you fill a bin.<br>
<br>
There are cases where this won't work well. For example, let's say you have 100 data points, you want 10 windows, but in your list of values, the number 15 comes up 30 times. There's no way to subdivide that neatly without some questionable hacks where you divide data points with the same value into multiple bins.comment:ask.metafilter.com,2010:site.162478-2333194Tue, 17 Aug 2010 05:49:47 -0800chrisamillerBy: chrisamiller
http://ask.metafilter.com/162478/Unclump-my-histogram#2333195
Trying to talk through this is a bit awkward. If you could provide a screenshot or drawing of what you have and what you want, it would be immensely helpful. You know, picture worth a thousand words and all that.comment:ask.metafilter.com,2010:site.162478-2333195Tue, 17 Aug 2010 05:51:39 -0800chrisamillerBy: Sutekh
http://ask.metafilter.com/162478/Unclump-my-histogram#2333199
Maybe a log transform (or some other kind of monotonic transform)?comment:ask.metafilter.com,2010:site.162478-2333199Tue, 17 Aug 2010 05:56:22 -0800SutekhBy: Mike1024
http://ask.metafilter.com/162478/Unclump-my-histogram#2333201
Are you attempting <a href="http://en.wikipedia.org/wiki/Histogram_equalization#Full-sized_image">Histogram Equalization</a>? It's supposed to turn your histogram into a straight line.comment:ask.metafilter.com,2010:site.162478-2333201Tue, 17 Aug 2010 05:58:05 -0800Mike1024By: DU
http://ask.metafilter.com/162478/Unclump-my-histogram#2333218
Sorry everyone, false alarm. I thought I was running into a theoretical difficulty when it was actually just a "bug" in the data. For some values, I have many data points with the exact same value. So they were falling on top of each other on the graph, making it look like some bins were underpopulated when in fact they were just standing on each other's toes. Counting *distinct* values per bin fixed that right up and now the graph looks much more reasonable.comment:ask.metafilter.com,2010:site.162478-2333218Tue, 17 Aug 2010 06:12:11 -0800DUBy: ROU_Xenophobe
http://ask.metafilter.com/162478/Unclump-my-histogram#2333317
The magic words here are "kernel density." You can futz with the bandwidth parameter to get any degree of smoothing that you want.<br>
<br>
Almost any statistical package will do a kernel density for a single variable. R and Stata have bolt-ons that claim to do kernel densities for the joint distribution of two variables, but I've never used them.comment:ask.metafilter.com,2010:site.162478-2333317Tue, 17 Aug 2010 07:42:41 -0800ROU_XenophobeBy: scalespace
http://ask.metafilter.com/162478/Unclump-my-histogram#2333512
Without a better idea of what you're trying to accomplish with the histogram/graph, it's hard to give you coherent advice on how to solve your problem. <br>
<br>
Regarding "clumpy" histograms: you can try using a <a href="http://en.wikipedia.org/wiki/Kernel_density_estimation">kernel density estimator</a> that ROU_Xenophobe mentioned, or use an adaptive bin-size histogram.comment:ask.metafilter.com,2010:site.162478-2333512Tue, 17 Aug 2010 10:00:09 -0800scalespace