Parsing statistical data
May 12, 2014 3:06 PM   Subscribe

I have a spreadsheet with 120,000 or so rows & need to pull out some data.

Specifically, the format (columns separated by commas, with gaps between rows added for clarity) is --






First field represents an identifier for materials associated with A, B and/ or C (second field).

Desired output is the number of materials associated with:
* A, B & C alone
* A & B; A & C; and B & C
* all three together (A, B & C)

So, given the above data:
* A alone = count of 1
* B & C = count of 1
* A & C = count 1
* A, B & C = count of 2

I can kludge my way through this, but certainly not in any kind of elegant way that would make repeating the analysis down the road with new data particularly easy or fun.

Anyone with database or Excel chops have any pointers that could get me headed in the right direction with this little project?

Many thanks!
posted by woodman to Computers & Internet (12 answers total) 2 users marked this as a favorite
I thought the limit for Excel was in the neighborhood of 65k rows... that makes me curious what kind of spreadsheet this is...

Assuming it is a CSV of some form (or can be output as one) I suspect someone will chime in here with some kind of sed or perl command that does this automagically. You will still need to give them EXACTLY what the data looks like or it will just be garbage-in, garbage-out. Whatever method you use for the first run of this should be relatively easily repurposed for later uses as long as the format and the arguments are identical.

If it is Excel, I suppose I could take a look. Your description is not specific enough for me to know how easy/hard it might be to do this.
posted by milqman at 3:29 PM on May 12, 2014

I can't understanding your notation ("columns separated by commas, with gaps between rows added for clarity"). Can you clarify? Does it look like this?
     |  A  | B |  C  |
Row 1:  A
Row 2:  A    B    C
Row 3:  B    C
Row 4:  A    C
Row 5:  A    B    C

posted by misterbrandt at 3:31 PM on May 12, 2014

xlsx files are limited to a shade over a million rows- this is true for excel 2007/2008 and onwards.
posted by rockindata at 3:38 PM on May 12, 2014

It sounds like you want a simple 2-factor count if

=COUNTIF(A:A,"A", B:B,"1")
=COUNTIF(A:A,"A", B:B,"2")
=COUNTIF(A:A,"A", B:B,"3")
=COUNTIF(A:A,"B", B:B,"1")

and so on. Then add up groupings to get your A&B, B&C, A&B&C.
posted by 2bucksplus at 3:45 PM on May 12, 2014

2bucksplus is right, except (at least in Excel 2007) you have to use the COUNTIFS() function instead of COUNTIF() if you want to do multiple arguments.
posted by ropeladder at 3:50 PM on May 12, 2014 [1 favorite]

OK, If I'm understanding you correctly, you have 2 columns: "Num" and "Let", and there's a separate row for every instance of Num, even if it has 3 Lets, correct?

This format isn't going to be the best way to get you there. But we can make a new format that will be!

I'm assuming Column A is "Num" and Column B is "Let", and this is in a sheet called "sheet1"

Make a new sheet in your workbook. Starting in row 2, put your numbers all down column A (so cell A2 is "1", cell A3 is "2", etc). Put your letters across the top (cell B1 should be "A", C1 should be "B", D1 should be "C"

Now, go to cell A2. This is the cell for Let=A and Num=1. We want this cell to have a 1 in it if it's "true", and a 0 if it's false. There are a few ways to do this, but I'm going to do it with a countifs:

Now, fill that formula to the right to the end of your letters, and down to the end of your numbers.

Now, we have the data in a much more usable format! You can do a lot more manipulation like this.

But back to your original question. Let's say you now want to use this data to find out how many Nums are only A. You can get there with the following countifs:

We're looking at the 3 columns in order, and assigning them the values of 1 (true), 0 (false), and 0 (false again). This will tell you how many rows met that criteria. If you want only B and C, it would look like:

Give that a try and let me know if you run into trouble! I'm happy to send you a sample doc.
posted by brainmouse at 3:52 PM on May 12, 2014

Another tip:

Because this is obviously fake data, and you have a lot of rows, there are a few ways to simplify this so that you can do a little less work.

The first is pivot tables. Instead of the first step where I made the new sheet, you can get Excel to do this automatically with a pivot table. Here's how (I'm assuming you're on a recent version of Excel for Windows here, some of these steps are a little different if you're elsewhere)

Select all your data, and go to insert --> pivot table.
Make your "row labels" Num, and your "column labels" Let
Put "Let" as your values. This should show up as "count of Let" (if it says "sum" instead of "count" you're going to run into some problems. This is fixable, but it's easier if you just drop a non-numeric value in there to begin with, then it defaults to "count")

This will automagically create that original table for you. It will give you some totals also. You should be able to do Part 2 of my advice above on the pivot table now (but be careful about the Grand Total row -- if they're all bigger than 1 it won't be an issue).
posted by brainmouse at 3:59 PM on May 12, 2014

Oops one mistake in that last comment -- because the non-matches will be Blank instead of 0, you'll need your final countifs to look slightly different:

To count all instances of A and A alone, it should look like:

With quotes surrounding nothing replacing the 0s.
posted by brainmouse at 4:05 PM on May 12, 2014

Excel is a million or so, 65k is from ancient versions.

Also I could do this in about 40 seconds in Stata or R with a few lines of code. Consider investing time in those languages if this will be a recurring challenge of your work. If you do send me a message and I'd be happy to code this particular issue for you as a learning guide.

Keep if vector=A | vector=B
posted by jjmoney at 4:31 PM on May 12, 2014 [1 favorite]

OK, I took your nomenclature to mean that there might or might not be something entered into each of the A, B and C column cells that would be different for each row, and you want to know how complete your data set is. A simple way to deal with this is to use the next three columns to store 3 if/isblank statements directed at the cells in A, B and C. Assign each one a different number such that A, B, C, A+B, A+C, B+C and A+B+C all equal different sums (hint: 1, 3 and 5 work). So going across, your 3 if/isblank statements will be:=IF(ISBLANK(a1),0,1) =IF(ISBLANK(B1),0,3) and =IF(ISBLANK(C1),0,5)

In the next column, add the results. Finally, set up a table that uses =COUNTIF pegged to each indicator sum against the column where you stored that. For example, for A+B it would count the occurrences of the number 4 (1 for A plus 3 for B) in column G, like so: =COUNTIF(G:G,4).
posted by carmicha at 5:59 PM on May 12, 2014 [1 favorite]

A little program for Association Rule Induction.
posted by the big lizard at 5:16 AM on May 13, 2014 [1 favorite]

Apologies for the long delay in getting back to this thread, and many thanks to the respondents.

I tried playing around with the COUNTIFS function, but couldn't quite get it sussed. I'm woefully weak with PivotTables as well. So I'll keep working on this as it'll be a good way to build my skills.

In the end, I ran out of time & had to resort to a pretty kludge-y method -- adding multiple columns, assigning values (1,3,5) -- thanks carmicha! -- to the three possible values for the second column, then summing these values on changes in the first column. Yuck.

As I say, will keep playing with more efficient & elegant solutions. I may also be able to generate the original data that's more amenable to this kind of analysis.

Thanks again to all!
posted by woodman at 5:23 AM on May 28, 2014

« Older Borderline Personality Disorder or Narcissism in...   |   Negotiating the right to moonlight Newer »
This thread is closed to new comments.