Find standard deviation in Excel, but first develop a labor-saving trick
June 21, 2013 12:54 PM   Subscribe

Help me with statistics and Excel. Especially help me if you know any labor saving methods. I want the median, mean and standard deviation for the average price of all items sold, but my spreadsheet-full-of-data doesn't tell me the price of each sale -- just the average price per store, and the number sold at that store. Something like this:

Store	# sold	Avg price per item
A	72		$16,824
B	208		$13,133
C	269		$15,190
...
So I think I need to count store A's price 72 times and store B's price 208 times, etc. It's pretty each to figure out algebra that will allow me to calculate mean treatment price given this information ("# of items sold at A" times "Avg. price at A"+"# of items sold at B" times "Avg. price at B" ... etc.) divided by (sum of # of items sold at all stores).

Is there a formula or other approach I can use in Excel that will also allow me to calculate the median and standard deviation relatively easily? Or do I need to re-create my spreadsheet so it reads
Store	Avg price per item
A	$16,824
A	$16,824
A	$16,824
... (etc, a total of 72 times)
B	$13,133
B	$13,133
... (etc., etc.)
There are 30-some stores, 99 distinct items and roughly 39,500 unique sales of those items across all the stores, so I'd *really* prefer not to have to copy and paste onto a new line to represent each individual item sold.
posted by croutonsupafreak to Work & Money (8 answers total) 1 user marked this as a favorite
 
The Wikipedia page on weighted arithmetic means offers some examples of and formulas for how to calculate these properties.
posted by Blazecock Pileon at 1:14 PM on June 21, 2013


Response by poster: Thanks, Blazecock Pileon. That's helpful, although some of that math is over my head. I'd love any help possible with translating to Excel-ese. In the mean time, I guess I'll dig up my old stats text book (knew there was a reason I kept it around) to see if it will help me interpret the Wikipedia entry.
posted by croutonsupafreak at 1:24 PM on June 21, 2013


Depending on what you want to know you may or may not be able to calculate it from the above data. More precisely, the median value (or standard deviation) of all products sold is not the same as the median average sale price from each store. This example should make that clear:

Store A sells two of the item, one at $1, and one at $7. The average price for store A is (1+7)/2 = $4

Store B sells one of the item, at $2, so its average price is $2.

Store C sells two of the item, one at $2, and one at $4. It's average price is $3.

The median price for all five items = median(1,2,2,4,7) = 2.

The median of the average prices is median(2,3,4) = 3, which is different. Even if you count each average the number of times that store sold something you would have median(2,3,3,4,4) =3, which is still different than the median price of all items sold.

So you really want to think about what you want to measure. You can calculate weighted medians and weighted standard deviations along the lines of the weighted mean, but that won't give you the same answers as if you had all the original sale prices.

Hopefully this isn't too pedantic. If you already know all this I apologize!
posted by pombe at 1:40 PM on June 21, 2013 [2 favorites]


Your total average for all stores is correct - you're taking the "weighted average."

I'm tempted to ask why you care about the median, unless this is a school exercise. For a normal distribution mean = median = mode.

For stdev, you could take the stdev of all the means which in excel is stdev(A,B,C...) but all that tells you is that if you were to draw a number out of the "hat of average store sales", you can predict where that value might fall.

I feel like I need more info about what you want in order to make the statistics give you a meaningful answer. I find the trick with statistics is to always ask yourself: yes but what does this math represent in real life?

(omg I am sooo jonesing to look at your spreadsheet now, does that make me sick?)
posted by St. Peepsburg at 1:45 PM on June 21, 2013


Let's say your #sold data is in cells B2 - B40, and the average price per item is in cells C2 - C40. (row 1 as your column headers). The total number of sales is going to be useful for all three calculations, so let's say you put that in cell B42:
=SUM(B2:B40)
The mean is:
=SUMPRODUCT(B2:B40,C2:C40)/B42
Let's say you put that in cell C42, because we'll use that number again.

This works regardless of the distribution of individual sale prices at each store.

I mention that because it's time for a huge caveat regarding the standard deviation and the mean: you don't have enough information to calculate the true values of these. To calculate those, you'd need to know more about the distribution of sales at each store. Those 72 sales at store A might be 72 sales at $16,824 each, or they might be 54 sales at $18,395 and 18 sales at $12,111 each, or any number of other distributions. And what that distribution is makes a difference for the median and standard deviation. (On preview, repeating pombe's warning.)

That said, if you want to assume that the 72 sales at store A were in fact 72 sales all at $16,824 each, and so forth, you can calculate the standard deviation and median given that assumption.

For the standard deviation, put the following in cell D2:
=B2*(C2-C$42)^2
Now pull that down to fill cells D3-D40. And in cell D42, put:
=SQRT(SUM(D2:D40)/B42)
That's your standard deviation.

For the median, you'll need to sort your data on average price per item (ascending, or descending, it doesn't matter). Then put a running total of the sales in column E, by putting the following in cell E2:
=SUM(E$2:E2)
and pull that down to fill cells E3-E40.

Now take half the number of your total sales (half the number in B42), and find the smallest number in column E which is greater than that. The corresponding price per item is the median. Special case: if a number in column E is exactly half the total number of sales, the median is the average of the price per item corresponding to that entry and the next price per item in the list.
posted by DevilsAdvocate at 1:48 PM on June 21, 2013 [1 favorite]


Sorry, that last formula should be
=SUM(B$2:B2)
Just missed the edit window.
posted by DevilsAdvocate at 1:54 PM on June 21, 2013


Response by poster: St. Peepsburg: You're right, I don't actually need the median.

My goal here is to identify striking patterns related to both stores and items for sale, along the lines of:
* Which stores are charging considerably more or less than the norm for multiple items? (At which point I can start asking if they are taking advantage of people and/or undercutting the competition as a result of their charges.)
* Which items tend to sell for about the same price consistently, and which have a very wide variation? (Which would lead me to ask why the variation exists.)

I can actually find so-so answers to both of these questions by just sorting and eyeballing what I see, but I'd prefer to be more methodical and rigorous than that with my work.


(I'm not actually researching stores and items sold, but the analogy is close and requires less explanation than going into full details, so I'll stick with it.)

posted by croutonsupafreak at 2:06 PM on June 21, 2013


Response by poster: kalessin and DevilsAdvocate, thanks for the potentially useful/helpful answers. It's gonna take me time to try to fully grok your posts, so follow-ups and/or best answer status may be forthcoming. We'll see.
posted by croutonsupafreak at 2:07 PM on June 21, 2013


« Older Is "polywood" (recycled plastic) patio furniture...   |   Why are kissing and hungry presented as opposites? Newer »
This thread is closed to new comments.