# Scaling issue calculating similarity between paired numbers

May 24, 2013 9:32 AM Subscribe

I have a list of paired numbers that span multiple orders of magnitude, and I need to find a method to a) compare within each pair in a way that does not disproportionately bias the comparison at the high or low end of the list, and b) define which pairs are dissimilar enough to be excluded from further analysis. The dataset itself follows a rough sigmoid curve, with a few pairs in the 1000s, more in the 100s, a lot in the upper 10's, some in the low 10's, and a few in the single digits. I have tried a few different comparison methods so far, including percent difference and relative percent difference of both the raw and log-transformed data.

The data come from utility bills, two years prior and two years after a conservation measure was implemented. For a given account (there are ~7000), I have a usage for four years (two before and two after). I am not concerned about how to compare the before and after (I don't need to do stats on that). I want to make sure that any usage that is very dissimilar* within each "before" and "after" period for each account gets flagged so I am making a fair comparison between those two periods.

So, the data look like this:

Account 1 BeforeReading1 BeforeReading2 AfterReading1 AfterReading2

Account 2 BeforeReading1 BeforeReading2 AfterReading1 AfterReading2

...up to Account7000

*Very dissimilar, according to my partners at the utility, is more than a 3-fold increase or decrease for a typical user (usage in the 50-75 unit range). This multiplier doesn't scale well across the orders of magnitude, and I would love to have a way to lean on some sort of mathematical standard like 95% confidence interval but I am not sure what that would be based upon.

The data come from utility bills, two years prior and two years after a conservation measure was implemented. For a given account (there are ~7000), I have a usage for four years (two before and two after). I am not concerned about how to compare the before and after (I don't need to do stats on that). I want to make sure that any usage that is very dissimilar* within each "before" and "after" period for each account gets flagged so I am making a fair comparison between those two periods.

So, the data look like this:

Account 1 BeforeReading1 BeforeReading2 AfterReading1 AfterReading2

Account 2 BeforeReading1 BeforeReading2 AfterReading1 AfterReading2

...up to Account7000

*Very dissimilar, according to my partners at the utility, is more than a 3-fold increase or decrease for a typical user (usage in the 50-75 unit range). This multiplier doesn't scale well across the orders of magnitude, and I would love to have a way to lean on some sort of mathematical standard like 95% confidence interval but I am not sure what that would be based upon.

You need more information from your utility partners about the definition of "Dissimilar" that is the heart of your problem. I suspect you will not find any general methods which will do what you want without getting more input on exactly what you want to find...

The binning method suggested by midmarch snowman might work but may still indicate outliers in a bin which are still fine by the "dissimilar" standard or you may have a bin containing all "dissimilar" measurements which don't have any outliers.

posted by NoDef at 9:59 AM on May 24, 2013

The binning method suggested by midmarch snowman might work but may still indicate outliers in a bin which are still fine by the "dissimilar" standard or you may have a bin containing all "dissimilar" measurements which don't have any outliers.

posted by NoDef at 9:59 AM on May 24, 2013

When you say you want to flag usage that is dissimilar within each before and after period, do you mean, you want to flag cases where before1 is very far from before2 and/or after1 is very dissimilar from after2?

So, ideally, you'd like to know what is the distribution of |before1 - before2| and of |after1 - after2| across the dataset, and then you'd like to flag cases where one or both of these differences is much larger than normal? And this is tricky because the raw measurements are at vastly different scales.

If I am understanding the problem correctly, you could do something like |before1 - before2|/before1 or |before1 - before2|/(before1 + before2) --that is, how large is the difference scaled for the size of the raw measurements, basically a percent change. Then you could calculate statistics on this dataset of percent changes, and look for outliers. One question to consider is whether you expect this dataset to be normally (vs, say logarithmically or exponentially) distributed. To get a feel for that, you could plot a histogram of the percent changes and model accordingly.

Is it the case that most before1, before2 measures are roughly on the same order of magnitude, or are there some really huge swings?

posted by pompelmo at 11:30 AM on May 24, 2013

So, ideally, you'd like to know what is the distribution of |before1 - before2| and of |after1 - after2| across the dataset, and then you'd like to flag cases where one or both of these differences is much larger than normal? And this is tricky because the raw measurements are at vastly different scales.

If I am understanding the problem correctly, you could do something like |before1 - before2|/before1 or |before1 - before2|/(before1 + before2) --that is, how large is the difference scaled for the size of the raw measurements, basically a percent change. Then you could calculate statistics on this dataset of percent changes, and look for outliers. One question to consider is whether you expect this dataset to be normally (vs, say logarithmically or exponentially) distributed. To get a feel for that, you could plot a histogram of the percent changes and model accordingly.

Is it the case that most before1, before2 measures are roughly on the same order of magnitude, or are there some really huge swings?

posted by pompelmo at 11:30 AM on May 24, 2013

Why doesn't the multiplier scale well across the orders of magnitude? Perhaps the criteria is "greater than 3-fold difference" for some ranges of the data, but some other criteria for other (larger) ranges? If the criteria is the same for all ranges, it scales fine.

posted by at at 9:13 AM on May 26, 2013

posted by at at 9:13 AM on May 26, 2013

Response by poster: Thank you all for your answers. I ended up doing a combination of things to work with the dataset.

1) I spoke to my utility partners again. Since the scaling issue was more of a problem at the low end (i.e. an increase from say 10 units to 40 units would not be a big deal in real life but would seem like one just based on the 4x increase), I gathered some additional information that allowed me to cull out abnormally low use (i.e. even lower than the most conserving users) and fixed some of those issues.

2) I compared the maximum and minimum values within "before" and "after" (dividing max by min) and flagged data that exceeded the 3x criteria.

After that, I felt comfortable making generalized before/after comparisons as well as year to year comparisons. The good news is that the data showed what I had predicted - the conservation measure had a greater impact on the higher users.

posted by nekton at 11:18 AM on May 31, 2013

1) I spoke to my utility partners again. Since the scaling issue was more of a problem at the low end (i.e. an increase from say 10 units to 40 units would not be a big deal in real life but would seem like one just based on the 4x increase), I gathered some additional information that allowed me to cull out abnormally low use (i.e. even lower than the most conserving users) and fixed some of those issues.

2) I compared the maximum and minimum values within "before" and "after" (dividing max by min) and flagged data that exceeded the 3x criteria.

After that, I felt comfortable making generalized before/after comparisons as well as year to year comparisons. The good news is that the data showed what I had predicted - the conservation measure had a greater impact on the higher users.

posted by nekton at 11:18 AM on May 31, 2013

« Older What was academia like before different IT... | private loan (<$100K) for an international... Newer »

This thread is closed to new comments.

I'm pretty sure its possible to do all that with excel, though I've personally never done standard deviation calculations in excel.

posted by midmarch snowman at 9:49 AM on May 24, 2013