You R to do this again, and again, and again...
October 10, 2011 8:07 AM   Subscribe

[StatsFilter] I have a data set which consists of 1000 groups. I'd like to perform the same commands on each group using the statistical software R. How can I write a loop function to perform this task? Or would any of the functions from the *apply family do a better job?

(Asking for a friend who has been stuck on this for a while...)

I have a set of agents. Each agent performs some events. The ID for the first event is the event with Previous_ID = 0. The order of the sequence of events in the data frame is by ascending order of the Previous_ID. The original order can be restored by going through the ID and the Previous_ID.

Sample data:

ID Previous_ID Agent
845 0 5360
926 153 5360
993 234 5360
234 845 5360
848 926 5360
153 993 5360
234 0 8765
968 234 8765
545 968 8765
913 0 2329
372 119 2329
719 189 2329
119 324 2329
761 355 2329
890 372 2329
266 719 2329
324 761 2329
189 890 2329
355 913 2329

For example, for Agent 5360, the original order of events is: 845 - 234 - 993 - 153 - 926 - 848.

The number of events performed by each agent are different.

I want to restore the order of events for all the agents, using the statistical software R.

I have written a function to restore the order of events conducted by one particular agent, e.g., Agent 5360.

agent.5360 <- events.data[events.data$Agent == 5360,]
order <- c()
order[1] <- agent.5360[agent.5360$Previous_EventID == 0, 1]
for (i in 2:nrow(agent.5360))
order[i] <- agent.5360[agent.5360$Previous_EventID == order[i-1], 1]

and the above commands give the original order of the events for Agent 5360 as described earlier.

How can I apply this to all the agents? Would any of the functions from the apply family help? Any pointers would be greatly appreciated!
posted by fix to Computers & Internet (12 answers total)
 
Best answer: Bleurgh formatting flagged for removal.

Here's a pastebin.
posted by cromagnon at 8:53 AM on October 10, 2011


Best answer: I have no idea (and kind of doubt) that this is faster than doing it the natural loopy way.

They key is probably the by function at the end.
posted by a robot made out of meat at 10:39 AM on October 10, 2011


Best answer: If by isn't exactly what you need, the plyr package has tons of related tools.
posted by a robot made out of meat at 10:45 AM on October 10, 2011


Best answer: Oh, 3 notes about the inner loop
1) order() is a useful function; you should be careful about writing over it
2) you should check that you got all the events with something like all.equal( sort( unique(output ) ) , sort( unique( c(prev_id , this_id ) ) ) )
3) you should put a try() or some such around the inner loop. that way R will keep going and do the rest even if a handful of agents don't work due to some kind of data error.
posted by a robot made out of meat at 11:17 AM on October 10, 2011


Response by poster: (A follow-up from my friend.)

Here's what I did.

but it only returned the results for the first agent. Should I have written something else for the return() part ?
posted by fix at 11:45 AM on October 10, 2011


Best answer: Yes (I think) - basically the return() releases the result back into the world at the first iteration and therefore stops the loop.

You would be better off declaring a data structure before the j loop starts that can hold j lots of the order.j object. This would probably have to be a list given the different lengths of each object.j. each j iteration adds one order.j to the list. Then at the end of the function declaration, return the whole data structure to the user.
posted by cromagnon at 12:05 PM on October 10, 2011


Best answer: return() halts the whole function. I'm not sure that the first line does what you want. Functionalize the part that works on the data for a single agent and use by().
posted by a robot made out of meat at 12:07 PM on October 10, 2011


Response by poster: "Thank you so much cromagnon and robot!

This time I used the function that worked for one particular agent, functionalized it and used the by() function at the end

My code

Outputs:
: 2329
[1] NA NA NA NA NA NA NA NA NA NA
------------------------------------------------------------
: 5360
[1] NA NA NA NA NA NA
------------------------------------------------------------
: 8765
[1] NA NA NA

Warning messages:
1: In ordering[i] <> number of items to replace is not a multiple of replacement length
2: In ordering[i] <> number of items to replace is not a multiple of replacement length
3: In ordering[i] <> number of items to replace is not a multiple of replacement length

Why did it work for one particular agent, but not with the by() function? Should I have specified the length of ordering?"
posted by fix at 1:38 PM on October 10, 2011


Best answer: I used different column names than you did. It looks like you used my names in the function.
posted by a robot made out of meat at 2:19 PM on October 10, 2011


Response by poster: "Hi robot,

Yes, I followed your advice "Functionalize the part that works on the data for a single agent and use by()." So I used your names to test for the outputs. "
posted by fix at 2:32 PM on October 10, 2011


Best answer: When R goes to look for localdata$prev_id it doesn't exist, because localdata has no column named prev_id. You want localdata$Previous_ID, since that's the name in your dataframe (if that's the name in your data).
posted by a robot made out of meat at 4:41 PM on October 10, 2011


Best answer: Sorry if that was confusing in my code example.
posted by a robot made out of meat at 4:49 PM on October 10, 2011


« Older I'm looking for suggestions for a quick, random...   |   I hear Mary Poppins had her policy via Traveler's.... Newer »
This thread is closed to new comments.