SAS/proc phreg code
April 7, 2015 9:55 AM Subscribe
Anyone familiar with proc phreg (Cox regression/survival analysis) in SAS who could help me figure out if my code is right? I'm new to survival analysis and my data are set up a little differently than the examples I'm seeing online so I'm not sure I'm doing it right.
Background:
I'm studying people seeking help. Participants described contacts with between 1 and 3 "responders" (e.g., friends, the police) in order- for example, a participant could have contacted just responder 1, or responder 1, then responder 2, then responder 3. I'm trying to predict help-seeking dropout, meaning that, for example, a participant contacted responder 1 but did not go on to contact a second or third responder- that participant would have a dropout at responder 1. So unlike other survival analysis, the observations are responders rather than time points- but they're still ordered in time. The independent variables in my model include characteristics of the people seeking help (e.g., gender) and aspects of their interactions with the responders (e.g., whether they liked the interaction). The data are right-censored for those participants who said that they contacted more than three responders because they could not record more than three responders in the survey. There are two people who only reported on responder 3; those people are left-censored because data are missing for responders 1 and 2.
Data setup:
The data are set up as a person-period dataset such that there is a line for each responder, which means that some participants have multiple lines. Responders are nested within participants. So a participant that contacted two responders would have two lines in the dataset; the participant-level data is the same in both lines and the responder-level data is different.
Here's what the data looks like.
Variables:
id is the ID number for the participant.
responder represents the responder number in the order that the participant contacted them. Possible values are 1, 2, and 3.
stoppedhelpseeking represents whether the participant stopped seeking help/dropped out after contacting that responder; 0 = no and 1 = yes.
gender is the participant's gender; 1 = woman and 2 = man
likedresponder represents whether the participant liked their interaction with the responder; 0 = no and 1 = yes
censor represents whether the participant did not report a dropout by responder 3.
Here is the code that I have (from the Allison survival analysis/SAS book):
proc phreg data = helpseeking plots=survival;
class id;
model responder*stoppedhelpseeking(0) = gender likedresponder /ties=efron;
run;
My questions:
-If I'm predicting dropout, should the code be stoppedhelpseeking(0) or stoppedhelpseeking(1)?
-How do I account for the right-censoring? I'm concerned that it's not explicitly reflected in my code.
-Do I need to account for the left-censoring, or should I drop those two people from analyses?
-Any other issues with the code that I should know about?
Background:
I'm studying people seeking help. Participants described contacts with between 1 and 3 "responders" (e.g., friends, the police) in order- for example, a participant could have contacted just responder 1, or responder 1, then responder 2, then responder 3. I'm trying to predict help-seeking dropout, meaning that, for example, a participant contacted responder 1 but did not go on to contact a second or third responder- that participant would have a dropout at responder 1. So unlike other survival analysis, the observations are responders rather than time points- but they're still ordered in time. The independent variables in my model include characteristics of the people seeking help (e.g., gender) and aspects of their interactions with the responders (e.g., whether they liked the interaction). The data are right-censored for those participants who said that they contacted more than three responders because they could not record more than three responders in the survey. There are two people who only reported on responder 3; those people are left-censored because data are missing for responders 1 and 2.
Data setup:
The data are set up as a person-period dataset such that there is a line for each responder, which means that some participants have multiple lines. Responders are nested within participants. So a participant that contacted two responders would have two lines in the dataset; the participant-level data is the same in both lines and the responder-level data is different.
Here's what the data looks like.
Variables:
id is the ID number for the participant.
responder represents the responder number in the order that the participant contacted them. Possible values are 1, 2, and 3.
stoppedhelpseeking represents whether the participant stopped seeking help/dropped out after contacting that responder; 0 = no and 1 = yes.
gender is the participant's gender; 1 = woman and 2 = man
likedresponder represents whether the participant liked their interaction with the responder; 0 = no and 1 = yes
censor represents whether the participant did not report a dropout by responder 3.
Here is the code that I have (from the Allison survival analysis/SAS book):
proc phreg data = helpseeking plots=survival;
class id;
model responder*stoppedhelpseeking(0) = gender likedresponder /ties=efron;
run;
My questions:
-If I'm predicting dropout, should the code be stoppedhelpseeking(0) or stoppedhelpseeking(1)?
-How do I account for the right-censoring? I'm concerned that it's not explicitly reflected in my code.
-Do I need to account for the left-censoring, or should I drop those two people from analyses?
-Any other issues with the code that I should know about?
Response by poster: Whoops- here's the real link to the data.
posted by quiet coyote at 10:55 AM on April 7, 2015
posted by quiet coyote at 10:55 AM on April 7, 2015
I think survival analysis might be on the wrong track. What you might want to do is an ordinal multinominal model, under the assumption of proportional odds ratios. But I have absolutely no idea how one might go about doing that in SAS.
posted by spaghettification at 1:28 PM on April 7, 2015
posted by spaghettification at 1:28 PM on April 7, 2015
Response by poster: Can you say more about why you think that survival analysis would be the wrong approach and a multinomial model would be preferred? I would think that it would have to be a multilevel model given the nested structure of the data. I'm using characteristics of both the participant--level 2--and the responder--level 1--to predict dropout after that responder, making the DV a level 1 variable. I had considered just doing multilevel logistic regression with a binary outcome of dropped out vs. did not drop out, but that loses the ordering. Multinomial would retain the ordering but then the dependent variable would be at level 2 rather than level 1, which I don't think multilevel models can handle iirc.
posted by quiet coyote at 1:53 PM on April 7, 2015
posted by quiet coyote at 1:53 PM on April 7, 2015
I agree that survival analysis is not appropriate in this situation, as I understand it. Proc phreg is really modeling time to event data and that's not really what you have here. It would be helpful to know exactly what hypothesis you are testing with this analysis because that will inform what methods you use.
For example, if you want to predict any dropout vs. no dropout, you could rearrange your dataset to have one row of data per ID (wide dataset) and create a new variable 1=any dropout after any responder and 0=no dropout. Then you could just do proc logistic without anything "fancy" because there won't be any repeated measurements. You could also model dropout after responder 3 (1=dropout after responder 3 and 0=dropout before responder 3 + no dropout; or alternatively 2=dropout after responder 3, 1=dropout before responder 3, 0=no dropout). This would also be proc logistic.
You could simplify things in a similar manner if you wanted to predict total number of responders used (of course, you'll need some assumptions about the distribution of the responders variable and it's relationship to the independent variables).
If you really want to maintain the long dataset with responders as the unit of time, you could use proc genmod , which will allow you to get GEE estimates for a model with repeated measures. This will allow you to estimate the odds of dropping out over time (here: responders). If I remember correctly, it would be ok that two of your participants are missing data on the first two responders, I think proc genmod just uses that data you give it.
You get the idea. The utility of my suggestions really hinge on the story you want to tell, how many observations you have, and the distribution of your data.
If you haven't stumbled across it already, the UCLA SAS website is an awesome resource.
Regression in SAS is my life, so feel free to send me a message and I can try to help you in more detail.
posted by stripesandplaid at 5:25 AM on April 8, 2015
For example, if you want to predict any dropout vs. no dropout, you could rearrange your dataset to have one row of data per ID (wide dataset) and create a new variable 1=any dropout after any responder and 0=no dropout. Then you could just do proc logistic without anything "fancy" because there won't be any repeated measurements. You could also model dropout after responder 3 (1=dropout after responder 3 and 0=dropout before responder 3 + no dropout; or alternatively 2=dropout after responder 3, 1=dropout before responder 3, 0=no dropout). This would also be proc logistic.
You could simplify things in a similar manner if you wanted to predict total number of responders used (of course, you'll need some assumptions about the distribution of the responders variable and it's relationship to the independent variables).
If you really want to maintain the long dataset with responders as the unit of time, you could use proc genmod , which will allow you to get GEE estimates for a model with repeated measures. This will allow you to estimate the odds of dropping out over time (here: responders). If I remember correctly, it would be ok that two of your participants are missing data on the first two responders, I think proc genmod just uses that data you give it.
You get the idea. The utility of my suggestions really hinge on the story you want to tell, how many observations you have, and the distribution of your data.
If you haven't stumbled across it already, the UCLA SAS website is an awesome resource.
Regression in SAS is my life, so feel free to send me a message and I can try to help you in more detail.
posted by stripesandplaid at 5:25 AM on April 8, 2015
I really don't know how survival analysis will actually behave with the discrete "times". It's not necessarily that survival analysis needs continuous time. After all, the Cox Proportional Hazards model only uses the orderings and discards all other information (that's what makes it so robust).
Having a closer look at your description of the data, I agree that the multinomial model would lead to an impasse. How do you fit each participant's data into one row? You have a missing data problem since you don't know what would have happened if they hadn't dropped out earlier.
However, you don't need a multilevel analysis. I would say logistic regression is the right tool, but I wouldn't follow stripesandplaid's suggestion of widening the data, and only asking whether they drop after any responder, or not. That would lead to the same impasse as above (you have between 1 and 3 responders per participant, so what is your predictor?).
OK, so here's what I would do. At each stage, I would ask “is this participant going to make it through this stage?” i.e. I would target stoppedhelpseeking as my Y. The predictors would be:
- the characteristics of the participant (gender)
- the characteristics of the responder at this stage (likedresponder), including the responder ID encoded as binary features if you have them
- the stage (1,2,3) as binary features
There are a few problems left with that analysis. It assumes (wrongly, but perhaps reasonably) that the responder at stage 1 does not affect the participant's behaviour in stage 2. It also doesn't explicitly account for correlation between the stages for a single participant. Neither of these issues would cause me to lose sleep.
I actually think the right-censoring is not an issue anymore. You're only ever asking the question “will they make it through the next stage?” and not “how far will they make it?”. The left-censoring is more problematic because you have missing responder data for those two participants. Technically, it's not right to just throw that information away, but since it's only 2, I wouldn't worry about it too much.
Lastly, I would consider using Ridge Regression (an L2 penalty on the size of the coefficients), carefully tuned using cross-validation. That tends to improve the quality of predictions in this kind of scenario.
posted by spaghettification at 8:19 PM on April 8, 2015
Having a closer look at your description of the data, I agree that the multinomial model would lead to an impasse. How do you fit each participant's data into one row? You have a missing data problem since you don't know what would have happened if they hadn't dropped out earlier.
However, you don't need a multilevel analysis. I would say logistic regression is the right tool, but I wouldn't follow stripesandplaid's suggestion of widening the data, and only asking whether they drop after any responder, or not. That would lead to the same impasse as above (you have between 1 and 3 responders per participant, so what is your predictor?).
OK, so here's what I would do. At each stage, I would ask “is this participant going to make it through this stage?” i.e. I would target stoppedhelpseeking as my Y. The predictors would be:
- the characteristics of the participant (gender)
- the characteristics of the responder at this stage (likedresponder), including the responder ID encoded as binary features if you have them
- the stage (1,2,3) as binary features
There are a few problems left with that analysis. It assumes (wrongly, but perhaps reasonably) that the responder at stage 1 does not affect the participant's behaviour in stage 2. It also doesn't explicitly account for correlation between the stages for a single participant. Neither of these issues would cause me to lose sleep.
I actually think the right-censoring is not an issue anymore. You're only ever asking the question “will they make it through the next stage?” and not “how far will they make it?”. The left-censoring is more problematic because you have missing responder data for those two participants. Technically, it's not right to just throw that information away, but since it's only 2, I wouldn't worry about it too much.
Lastly, I would consider using Ridge Regression (an L2 penalty on the size of the coefficients), carefully tuned using cross-validation. That tends to improve the quality of predictions in this kind of scenario.
posted by spaghettification at 8:19 PM on April 8, 2015
« Older Damn you Loud Morning Kitties!!! | Looking for a Scene from "A Heartbreaking Work of... Newer »
This thread is closed to new comments.
posted by number9dream at 10:10 AM on April 7, 2015