Quick Data Collection Problem
November 10, 2015 1:52 PM   Subscribe

I have a CSV with one column that contains User IDs. I need to make another CSV that counts how many times each User ID appears in the document, and output it to another CSV That has something like UID1 >> 3; UID4 >> 1; UID20 >> 0. There are thousands of User IDs, so I need to be able to set up a range for it to scan for. Is this possible with free software and not much programming expertise?

Here's a sample of the data:

The range of total data is 1 to 5. The data in the spreadsheet is

1
1
1
3
3
5

I need it to output that there are three 1's, zero 2's, two 3's, zero 4's, and one five.

Even though 4 and 2 don't appear in the document, I still need it to go down the list and fill in '0' for those while counting the other, extant values. This will be applied to lists that number in the few thousands.
posted by codacorolla to Computers & Internet (8 answers total) 1 user marked this as a favorite
 
Response by poster: Actually, I just realized that what I'm asking is a little off.

Instead of counting EVERYTHING in a range, I just want the script to count based off of another list. So going back to the previous example,

Counting Master List: 1, 3, 4

Data Sheet:
1
1
2
3
3
5

That would count that there are two 1's, two 3's and zero 4's while ignoring the 2 and the 5, since they don't exist on the list.
posted by codacorolla at 1:56 PM on November 10, 2015


A pivot table in Excel or even Google Sheets should do this. Open Office Calc does pivot tables too.
posted by GuyZero at 2:01 PM on November 10, 2015 [1 favorite]


Although a pivot table may not do the filtering you want, but it can definitely do counting/summing.
posted by GuyZero at 2:01 PM on November 10, 2015


This is very easy with any spreadsheet program (I know google docs will do it, and while I don't know the features exactly of some of the downloadable Excel replacements, I assume they all have this feature): just create a pivot table. You want the userid as the row labels, and the count of userid as the values. Change the range of your pivot table so it's the entirety of the columns you care about (probably something like A:A)

To do part 2, there are a few ways. One is, have another list of numbers, so your master list would look like:
1
3
4

Then do a vlookup on your data and have a condition for errors, so your main data sheet would have in the second column (this is excel formatting, but other programs are similar):
=iferror(vlookup(A1,[range of your master list],1,false),"error")

That should give you the same value back if it's in your master list, and "error" otherwise. Then you can add that as a filter to your pivot table and exclude "error"
posted by brainmouse at 2:03 PM on November 10, 2015


Best answer: Do you have Excel? If so, this can be done pretty easily! Here is how I would do it -- modify these instructions as you see fit!

I'm changing this from numbers to animals to make it less confusing, so if you have a list from A1 to A7 that's like:

Pig
Pig
Cow
Pig
Horse
Cow
Sheep

I would copy/paste them all into a new column and hit the "remove duplicates" button for that list so you have them somewhere like C1 to C4 and it looks like like:
Pig
Cow
Horse
Sheep

In the column next to it, do the simple formula =COUNTIF(range, criteria) so your range will be the first list and the criteria will be the second list (e.g. =COUNTIF(A$1:A$7, C1)) and then you can just drag it down (=COUNTIF(A$1:A$7, C2), &c.) and then you have a nice list so it looks like:

Pig 3
Cow 2
Horse 1
Sheep 1

If you don't have Excel, I imagine there's an OpenOffice version that will do basically the same thing. Good luck!
posted by Mrs. Pterodactyl at 2:03 PM on November 10, 2015 [3 favorites]


In a unix:

anglachel:~/tmp> cat data
Pig
Pig
Cow
Pig
Horse
Cow
Sheep
anglachel:~/tmp> cat data | sort | uniq -c | sort -n
1 Horse
1 Sheep
2 Cow
3 Pig
anglachel:~/tmp>


If you've got comma-delimited data there, you can use the cut utility to slice out the field you want.
posted by mhoye at 2:11 PM on November 10, 2015 [6 favorites]


If you're only going to do this once, download a free trial of SPSS. From the menu, choose data-> aggregate. Set the ID variable as your break variable and then check off the button that says something like "number of cases" then give the line next to that button the variable name for the number of cases.

Assuming you want a completely new data file with this information in it, choose create new data file. If you want it merged into your existing data, (that would look like this:
ID NumTimes
1 3
1 3
1 3
3 2
3 2
5 1

) then that's the default.

You can also at the same time calculate other aggregate measures if that's of interest. Like let's say your data contains a variable that is each person's hours worked that week (so each case is a person-week), then you can calculate at the same time the average hours worked per week for that ID. Or if you think people did more than one survey and might have listed different satisfaction levels each time you can get the mean or the min or the max for each person or whatever you want.

PSPP is an open-source version of SPSS and probably has similar functions.
posted by If only I had a penguin... at 2:17 PM on November 10, 2015


Response by poster: Mrs. Pterodactyl's method worked like a charm. Thanks!
posted by codacorolla at 4:58 PM on November 10, 2015 [1 favorite]


« Older We're moving on up.   |   Best Slippers/Moccasins? Newer »
This thread is closed to new comments.