Can you help me with an Excel question?
April 22, 2010 10:12 AM   Subscribe

Can you help me with an Excel question?

I have an Excel spreadsheet.  Each row represents a customer.  Columns include customer ID, then each product that customer uses.  Sometimes a customer uses 2 of a product.  Each product column contains either the number of that product the customer uses, or a zero.  I need the easiest way to figure out something like this:  For customers who only have one product, the commonest product is A (ok, that one’s easy).  For customers who have 2 *different* products, the commonest products are A and B.  There are 70 products.  I think that to be useful, I would want to blow this out through about 20 of these products.  I would appreciate any help!
posted by ersatzkat to Computers & Internet (13 answers total) 7 users marked this as a favorite
You want the LARGE() function. It is like MAX() but allows you to specify the 2nd largest or 3rd largest (or nth largest) number in an array. Make sure to use CTRL+SHIFT+ENTER when you input the formula.

If your Customer ID is in col A, and each of your 70 products are in rows B through XX, and your first Customer is in row 2, you want this function:

=LARGE(B2:XX2,1) will return the largest number
=LARGE(B2:XX2,2) will return the second-largest number, etc
posted by jckll at 10:25 AM on April 22, 2010 [2 favorites]

and each of your 70 products are in rows columns B through XX
posted by jckll at 10:26 AM on April 22, 2010

Here I was thinking of some really complex way to solve this using COUNTIF and COUNTBLANK and whatnot. jckll, that is cool. I learnt something new today.
posted by thewildgreen at 11:10 AM on April 22, 2010

Apologies if I'm being thick-headed or not plain (and I do appreciate you taking the time to look at this). When I applied this function, it gives the me largest number (or second- or third-largest) in each customer row. What I'm trying to get to is information across all customers. Say I have a list of customers, and they all have some combination of fruits - apples, bananas, oranges, and pears. I need to know what's the commonest fruit amongst all customers. We'll say apples. That's easy for me to get to because all I have to do is total each fruit column. But for every customer that has apples AND another fruit, what's the second commonest fruit? Then third, then fourth.

In my head, it sounds all professorial and sense-making - "For customers who have only one fruit, the most common fruit is apples. For customers who have two fruits, the most common fruits are apples and pears. For customers who have three fruits, etc".

Either I don't know how to apply the function that you've so kindly shared, or I don't know how to articulate what I'm really trying to get to. It's been a trying day, I apologize if I'm sending folks 'round the way for something that's really simple.
posted by ersatzkat at 4:54 PM on April 22, 2010

How are you defining "commonest"? For example, say you have the following situation:
Customer Apples Oranges Bananas
A          1000       0       0
B             0       1       0
C             0       2       0
In this case, all customers have exactly one fruit. Is the most common fruit Apple (because it occurs 1000 times) or is the most common fruit Orange (because it occurs between two customers)?

Also, in the following situation:
Customer Apples Oranges Bananas
D           100       0       1
E             0       1     100
Would we say that Apples and Bananas are the most common fruits among customers who have two fruits despite the fact no customer actually has both Apples and Bananas?
posted by mhum at 6:10 PM on April 22, 2010

mhum - with your chart (and thank you, the visual helps me), here's what I would glean that fits what I'm trying to say. For customers (A through E) who have one fruit, the most common fruit is oranges. Three of my customers have at least one orange. For customers who have an orange and another fruit, the most common second fruit is bananas - one orange customer has at least one banana, and two have no second fruit.

Each step through the analysis would take the previous "most common fruit" into the calculation, because what's ultimately important to me is understanding the common fruit groups that people tend to have.

The number of oranges per customer is not important - the number of customers with oranges is. I now see I might have confused that point when I mentioned above that "sometimes a customer uses 2 of a product". What I should have said is that if a customer doesn't use a product, there's a zero, otherwise there's a "non-zero", which I think must be useful information in some way to get me where I need to go.
posted by ersatzkat at 7:43 PM on April 22, 2010

Okay, this is a little more clear now, although I'm still not exactly sure what your ultimate goal is.

How would your analysis proceed in the following case:
Customer Apples Oranges Bananas
A             1       1       0
B             1       0       0
C             1       0       0
D             1       0       0
E             1       0       0
F             0       1       1
G             0       1       1
H             0       1       1
If I understand correctly, you would first start by observing that Apple is the most common fruit. Next, of all the customers with at least one apple, the most common second fruit is Orange. At this point we would have to stop since there are no customers with three fruits. However, Apple and Orange is not the most common combination of two fruits; the most common pair of fruits is actually Orange and Banana.

If your goal is really to find the most common pair, triple, quadruple, ..., n-tuple of fruits, this is actually much easier (in my mind) than my understanding of your "previous most common fruit" analysis.
posted by mhum at 9:57 PM on April 22, 2010

That is a great example chart by mhum. ersatzkat -- your latest explanation makes the problem completely different and more complex. I cannot think of a way to solve this without using VBA code. It would be not too difficult to do this with VBA (my initial try would be a few nested loops, with the inner-most one parsing the data row-by-row), but I'd be pleasantly surprised if anyone could show how to solve this with just excel formulae.
posted by thewildgreen at 10:01 PM on April 22, 2010

mhum - your second example is spot on (and thank you again). If your data set was the one I was working with, I would stop at apple and orange. I wish I could just glance at my own data and figure it out, instead of 70 products and upwards of 18k customer records.
posted by ersatzkat at 3:58 AM on April 23, 2010

Okay, I think I have an idea of how I'd do this although it'll be rather manual and tedious.

But first, I just want to confirm that you're looking for what I think you're looking for. Consider the following expansion of my last example: Customer 1 has Apple and Orange. Customers 2 through 51 have just Apple. Customers 52 through 100 have Banana, Grape, and Strawberry. In your analysis, you'd first start by observing that Apple is the most frequent fruit (Customers 1 through 51). Then, of the customers with Apple, Orange is the most frequent second fruit (in fact, the only other second fruit). Now, do we stop here because there are no customers with Apple, Orange, and some third fruit? Are you more interested in knowing that exactly one customer has Apple and Orange than the fact that just under half of your customers have Banana, Grape, and Strawberry?
posted by mhum at 9:36 AM on April 23, 2010

mhum - that's exactly it. I could explain more about why that makes sense to me, perhaps over MeMail. It's work-related and I've probably dropped enough info about my job in some of my answers that it'd be a not-great idea (for me) if I were to air project-type detail here. Thanks so much for being so patient.
posted by ersatzkat at 10:25 AM on April 23, 2010

Okay. Here we go. This will be extremely manual so you might want to familiarize yourself with keyboard shortcuts if you haven't already (I'll point them out as I go along).
1) Make a copy of your sheet. We'll be modifying your table in this process (e.g.: deleting rows) so we don't want to touch the original data. Let's say that Customers are in Column A and Fruits are in Columns B through BS. I'll assume that Row 1 has the column headers (e.g.: "Customer", "Apple", "Orange", etc...)

2) Insert a row at the the very top (so that your table now starts in row 2 instead of row 1).

3) In cell A1, put in a huge number, bigger than the number of rows in your table (say, 99999999). This huge number will be used to keep the Customer column at the left-hand side when we start sorting the table.

4) In cell B1 put this formula: =COUNTIF(B3:B18000,">0") where B18000 is the bottom of the column. Copy this formula across through BS1. Now, cells B1 through BS1 should contain the count of customers who've ordered each fruit.

5) Select the entire range (i.e.: A1:BS18000). You should be able to do this by hitting Ctrl-A. Hit Alt-D and then S to bring up the sort menu. Sort left to right (in Excel 2007, this is under the Options button) and decreasing based on row 1. Now, Column B should be the commonest fruit.

6) Select just your table (i.e.: A2:BS18000, excluding row 1). If you don't know how to use Ctrl-Shift-Arrow Keys to select large ranges, now's a good time to learn. Sort this range from top to bottom, decreasing, based on Column B.

7) Scan down Column B (visually) until you see zeros (they should all be at the bottom). You can use conditional formatting to help highlight where the zeros start, but it's not necessary. Select all the rows with a zero in Column B and delete them. Ctrl-Shift-Arrows will help here too. When you delete these rows, the numbers in row 1 will change.

8) Do Step 5 again (i.e.: select the entire range, sort left to right decreasing on the first row). Now, Column C should contain the most common fruit among all people who had the fruit in Column B (Column B should remain the same).

9) Do Step 6, except this time you should sort based on Column C.

10) Do Step 7, except this time you're scanning down Column C for zeros.

11) Do Step 5 again.

12) Do Step 6, except this time you should sort based on Column D.

13) Do Step 7, except this time you should scan down Column D.

14) You should get the idea here. Keep looping through steps 5, 6, and 7, each time moving over one column.

15) When you're done (after 20? iterations), row 2 should have a list of fruits in the order that you want. Column B should be the commonest fruit, Column C should be the commonest fruit associated with the Column B fruit, Column D should be the commonest fruit associated with both fruits in Columns A and B, etc...
If this is a one-shot (or, at least, once in a blue moon) thing, I could see this as tolerable. If this is an all-the-time thing, I'd start thinking about developing a programming solution, either VBA inside Excel or exporting your data out of Excel and using some script-y language like Perl, Python, or Ruby.
posted by mhum at 11:20 AM on April 23, 2010

mhum - thanks so much for this. Parental health drama has kept me away this weekend, but I'm taking this to work this morning to give it a go. I'll post the results - thank you!
posted by ersatzkat at 3:40 AM on April 26, 2010

« Older This is why I don't gamble..   |   How to maintain sub navigation on a Wordpress page... Newer »
This thread is closed to new comments.