# I thought this new scoring system would take care of all the other issues...December 3, 2012 8:03 AM   Subscribe

I have a bunch of scores for sites that are the sum of the individual scores of the samples that they contain. The number of samples in each site varies from 1 to several hundred. I would like to adjust the overall site scores to adjust for the variance in samples, so that a site with 200 samples doesn't overwhelm a site that has 10 where the site may be just as significant. However, I'm at a complete loss as how to accomplish this. Any thoughts?
posted by buttercup to Science & Nature (11 answers total) 2 users marked this as a favorite

Divide the sum of the scores for a site by the number of samples for a site. ie a site with 5 scores:
(50 + 75 + 100 + 25 + 0) / 5 = 50
and a site with 2:
(50 + 50) / 2 = 50
posted by straw at 8:10 AM on December 3, 2012

I'm possibly misunderstanging something here, but to expand a bit on straw's response, why not just determine the mean (what straw said) and median (to compensate for outliers that can mess up a mean) of each site's scores?

(incidentally MS Excel can calculate Mean and Median for any given data automatically using the =AVG and =MEDIAN functions)
posted by Wretch729 at 8:20 AM on December 3, 2012

Sorry, posted too soon, I meant to say I've tried a number of the standard approaches and I haven't found anything that seems to work in the way I'm looking for. The average method seems to downgrade the important sites too much. I've tried a couple of other methods including SQRT of the whole score as well as just divide the final score by the SQRT of the number of samples, but haven't had anything that seems to work.
posted by buttercup at 8:23 AM on December 3, 2012

So... you want to grade sites on a curve? OK I'm being flippant but if comparing median scores across sites isn't working I don't know how to answer this without more info about what you're trying to measure. Are the scores a measure of quality or of quantity? In other words is the point of the scores to show how well a site is doing or how much it has done? Would a site with 20 samples of high quality in theory get a higher score than a site with 200 low quality samples? What are you trying to quantify?
posted by Wretch729 at 8:31 AM on December 3, 2012

this is an article that tackles the problem of how to sort a list by it's elements ratings. It gives examples (such as urbandictionary definitions and amazon search results) and explains how their systems can give flawed results, or be skewed when some elements have many ratings and others do not.

The author takes the problem and restates it as What we want to ask is: Given the ratings I have, there is a 95% chance that the "real" fraction of positive ratings is at least what? and then sorts by the confidence interval.
posted by mce at 8:53 AM on December 3, 2012

In a way, yes, I would like to grade the sites on a curve. The data in question relates to rare species and the sites are defined landscape units (eg. a wetland). Scores for the rare species are based on its global rarity and a measure of quality. So a species that is globally rare (may only exist in a few places on earth) and is of high quality (eg. large population size) would get a higher score than a species that is not all that rare (its in 100 sites) but of the same quality.

I've taken those individual species scores and summed them for each individual site. So there are between 1 and 300 species (eg. "samples") per site. The site that has 300 species at it, which are not globally rare, but still significant, is at the top of the list with a site score of ~6000. A site that has the largest known population of species that only exists in a few places in the world (plus 6 other species) has a score of ~450. A site that has twenty species all significant, but less globally so, is ending up scoring higher than the one with a score of 450 mentioned above. I'm looking to figure out a way adjust for the number of samples in each site so its a little more balanced.
posted by buttercup at 8:55 AM on December 3, 2012

Unfortunately I don't have a good easy answer, as determining how to weight your variables seems like it's inevitably going to be pretty subjective. Is there some absolute scale of rarity you could use to quantify the rareness of a given species and use that to weight your data? Like some ranking from the WWF or the UN or something? I think the US Endangered Species Act has three main categories (endangered, threatened, candidate) and several other minor ones (experimental populations, endagered due to smilarity with another species, etc.). Could you use something like that and try assigning various weights to each ranking? It's admittedly crude but might be useful.
posted by Wretch729 at 9:22 AM on December 3, 2012

Hmmm, one quick way to 'promote' sites with rare species might be to square the global rarity value when calculating site scores and use the log of the population sample size.
posted by mce at 10:02 AM on December 3, 2012

Boilerplate: As others have pointed out, there's some level of subjectivity inherent in this.

Paraphrasing to see if I'm understanding this correctly: You basically want to rank sites based on something like damage to diversity if removed, ja?

I might try something like the following:

Consider the geometric mean of the sum of the scores across all species, across all sites. Call this GT. I recommend the geometric mean because it essentially treats a percentage change in any given score equally. Since you're basically weighting your samples by a global rarity factor, this means that if species A has twice the rarity score of species B, a 10% increase (decrease) in A is the same as a 20% increase (decrease) in B. You can use whatever other aggregate score you want, obviously.

Now, for each site S, compute GS, the geometric mean with the scores from site S set to 0. Then GT - GS should be something like the amount of diversity added by site S.

Alternatively: Take logs and squares and sqrts of things until you get something that feels right and then find a post hoc justification. (I'm in the private sector now, so I can suggest that without blushing.)
posted by PMdixon at 10:37 AM on December 3, 2012

To me it sounds like you're trying to re-invent a species / diversity / ecological index. Is there any particular reason you're not using, or haven't examined the method and maths behind, any of the existing ones e.g. SIGNAL, Simpson's, etc?
posted by Pinback at 12:08 PM on December 3, 2012

Information Retrieval to the rescue! You need tf*idf, term frequency multiplied by inverse document frequency. It will let you score less frequent species higher.

In your case, term frequency = species' score, and document = site. So, within each site, take the score for each species, and multiply it by log of (number of sites)/(number of sites in which the species appears). Then add these scores for all species within site to get the site score.

Example:

1. Plain sum (as you're doing now) - Sites 1 and 2 get a higher total score than Site 3 which has a rare sample of highly-valued species:

Site 1: Species 1 (score 15), Species 2 (score 20), Species 3 (score 25) = 60;
Site 2: Species 1 (score 15), Species 2 (score 20), Species 3 (score 25) = 60;
Site 3: Species 4 (score 50) = 50;

2. But with the magic of tf*idf, we get:

Site 1: Species 1 (score 15*0.176=2.64), Species 2 (score 20*0.176=3.53), Species 3 (score 25*0.176=4.4) = 10.57;
Site 2: Species 1 (score 15*0.176=2.64), Species 2 (score 20*0.176=3.53), Species 3 (score 25*0.176=4.4) = 10.57;
Site 3: Species 4 (score 50*0.477) = 23.85;

Inverse document frequency for Species 1 is log(3/2) (there are 3 sites, and the species appears in 2 of them); the same for Species 2 and 3. Inverse document frequency for Species 4 is log(3/1) (there are 3 sites, and it appears only in 1 of them).
posted by Ender's Friend at 9:08 PM on December 3, 2012

« Older Need to Move and Store Vast Trove of Art in...   |   Financing a renovation of a house in a different... Newer »