Please make statistics stop hurting my head!
April 6, 2009 11:47 AM   Subscribe

Confused by statistics and regression analysis.

So stats has always been a weak point for me and I've been struggling with a concept for awhile now. I have access to SAS and Excel, and if someone could walk me through it, would be willing to use R, I just never have before.

I think a big part of my problem is that I can type things into stats packages and get numbers and significance out, but that it's not making any sense to me and I think goes to a fundamental lack of understanding of regression analysis.

I've got this data. I've got 1 condition, "D" that has two states- high and low, and another condition, "T" that has 4- 0,1,3,5. What the data look like graphically is that for high "D", it increases from 0-1 or 0-3 and then plateaus off, but for low "D", it increases linearly.

I want to statistically prove that those two lines are different from each other. One problem being a high degree of subject variability (such that when I do an anova with subject as a variable, I always get a strong main effect of subject (which is actually expected with the type of data I've got).

To make things more complicated, this is actually a combo of behavioral and fMRI data, in which what I'd really like to do is to use the the behavioral data (which behaves as above) as sort of regressors to show which areas are activated in the same way as the behavioral data, but this might be beyond Ask MeFi's range as I think I'd need someone experienced with Freesurfer for that one.


So after all that, can anyone actually help me figure what I'm supposed to be running and how? I've tried anovas and SAS's glm, but I don't really understand what is going on and what to input to get what I want out.
posted by katers890 to Science & Nature (23 answers total) 4 users marked this as a favorite
 
I'm confused by what you mean "it increases from 0-1 or 0-3." Is there an element of time to this as well?
posted by bsdfish at 11:49 AM on April 6, 2009


Or are you talking about the distribution (histogram) of T?
posted by bsdfish at 11:49 AM on April 6, 2009


Response by poster: so let's say performance increases from T=0 to T=1 and then levels off, or increases T=0, to T=1, to T=3 and then levels off. Basically in the high case, performance increases and the plateaus, while in low case, performance keeps increasing linearly as T increases.

T isn't time, it's number of targets. If that helps.

Sorry if I'm not making much sense, there's a lot going on in this data that I'm struggling with.
posted by katers890 at 11:53 AM on April 6, 2009


If you are interested in proving that the distribution of T when D=0 is different from the distribution of T when D=1, you may not necessarily need to fit something like a linear model. Rather, you want a hypothesis test of the hypothesis P(T | D = 0) different from P(T | D = 1).

In this case, you may want to look at the Fisher's exact test (of Chi-squared test) test or the Kolmagorov-Smirnov test. All of those will assume that each pair of (D,T) observations are independent of each other, and will tell you whether there is statistical evidence to believe that the distribution of T is different for different values of D.

- Fisher's test will treat each bin of T as factors, not ordinals (there is no implication that 1 is 'closer' to 0 than to 5)
- Chi-squared will work under the same setup as fisher's, but may b easier to compute
- KS will basically test if the shapes of the cumulative distribution of T are the same or different, so it will treat the values of T as ordinals: a slight shift of mass from 0 to 1 will be somewhat different than a shift from 0 to 5. (note: sort-of).
posted by bsdfish at 11:59 AM on April 6, 2009


What field are you in? There are many books that really break down statistics to what i MEANS as applied to that field. It gets back to basics, but I found it made a world of difference in working with the stats. It helped me understand what the tests were comparing, rather than just looking at a readout on the screen and grabbing some meaningless number. I have a good one for biology that I can recommend. I cant remember the title right now but if youre interested message me and Ill look it up.
posted by CTORourke at 12:00 PM on April 6, 2009


Whoa, ignore all what I wrote ... you also have a performance observation that you didn't mention in your original post!

In this case, you have a somewhat harder thing to test: whether performance given T and D depends on the value of D. You will need more assumptions here and each set of assumptions will lead to a different model and test.

- Is it reasonable to say that for any fixed assignment of T and D, the performance is normally (gaussian) distributed?


- Do you wish to assume that the number of targets affects performance linearly? IE, moving from 1 to 3 targets increases performance the same as moving from 3 to 5.
- If the answer is yes, your hypothesis test will be able to have a reasonable amount of power -- your p-values may be more significant if the distributions are actually different. However, if the assumption turns out to be false, the numbers you get back are garbage.
- If the answer is no, you will have to do a more general test and it will have less power.

Honestly, I think you need to read a book on statistics as related to your field, as CTORourke suggested. Alternatively, ask around in the stats department if someone can help you out ... if you get a pointer, invite him/her to coffee and explain the problem in person, bring the data, etc. I don't think you have a problem that's very hard to solve, but the problem is that there are many approaches that *could* work if the assumptions are met, so a more in-depth consultation is required to figure out if that's the case.
posted by bsdfish at 12:11 PM on April 6, 2009


So you have three variables, D, T, and performance. I assume performance is your dependent variable. I assume D=1 if it's high and D=0 if it's low.

Things vary strongly across regression-oriented and anova-oriented worlds.

In a regression-oriented world, you'd just need to create an interactive effect DT by multiplying D and T together. Then do a regression like this (in stata format -- regress DV IV1 IV2 etc)

regress performance T D DT

the coefficient reported for T is the effect of a unit increase in T when D=0

the coefficient reported for D is the effect of changing from D=0 to D=1 when T=0

the coefficient for DT is how much the effect of T changes when you shift D from 0 to 1 (roughly)

If the coefficient on DT is significant, your lines are discernibly different. roughly.

In practice, you'd probably use an ordered logit or ordered probit instead of OLS. Same deal, but interpretation is trickier because of the nonlinearities.
posted by ROU_Xenophobe at 1:00 PM on April 6, 2009


Are you doing this as part of some university research?

If so, find out if your department has a dedicated statistician, or a relationship with anyone in the statistics department. It's good to have a basic understanding of statistics, but let the experts run the data and tell you what it really means.
posted by bengarland at 1:18 PM on April 6, 2009


Response by poster: So yes, this is part of my Ph.D. research in a cog psych/neuroscience field. I've sat through at least 4 stats classes in my life (both undergrad and in grad school), but each seems to teach at this extraordinarily basic level that it is way too simplistic and makes me almost immediately tune out. Because of this, all the basic stats I've got, but when it gets into more complicated areas, I can't quite follow what I'm supposed to be doing and what it all means. This is sort of the crux of my problem, so as far as books go I'd need something that wasn't too complicated to understand but handled more complex types of data.

And I have to do this myself, hunting out a statistician, though it would be nice, not really an option.


ROU_Xenophobe: So basically I've run a GLM in sas with performance = T D T*D but i'm not sure what the coefficients would be, I've got significance values out (T is always sig, as is D, but the interaction is teetering at p~0.1. Does that say that the as you shift from D=0 to D=1, T is almost, but not quite different?

Or am I having a problem with the fact that for D=0 the data is linear, but for D=1 it isn't (increases as T goes from 0->1 items and then plateaus)? What is this order logit or ordered probit concept?
posted by katers890 at 2:07 PM on April 6, 2009


I want to statistically prove that those two lines are different from each other.

It's a bit of a different approach from regression, but it might help if you think of your T levels as factors instead of numbers, i.e. map them to "letters":

T { 0, 1, 3, 5 } -> { A, B, C, D }

Here's one possible sample, based on how you described the problem:

Low: AAAABBBBCCCCDDDD

For a low condition, you progress from A to D at a linear rate.

Here's another example:

High: AAABBBBBBBBBBBBB

In a high condition, you plateau to B quickly.

In a population of "words" like these ("universe" of samples), you might think of measuring the similarity of said words by scoring substitutions.

As you go along the word, you might think it would be more "expensive" for a D in a low condition to switch to a B in the high condition. But to switch from an A to a B wouldn't "cost" as much. The lower the total cost, you might think two strings are more similar, and vice versa.

Once you have a scoring matrix, ungap-BLAST the two strings to measure the statistical similarity.
posted by Blazecock Pileon at 2:13 PM on April 6, 2009


Response by poster: I guess a side, related question would be (and honestly I'm not a completely idiot, I've done a fair amount of stats along the way, but something about this is breaking my brain):

So if I have my indep. variables D (high and low) and T (0,1,3,5), and I have a dependent variable of performance, and another dependent variable of brain activity at each point. How would I go about showing that performance across T for each given D is significantly (or not) correlated with activity? Or possibly if a linear model correlates with activity for T at low D, but a model that increases and the plateaus fits T at high D?
posted by katers890 at 2:14 PM on April 6, 2009


It actually sounds as if you have five variables: D, T, performance, brain activity, and subject. Is "performance" a discrete or continuous variable? What about "brain activity"? What are their ranges? And do you have the same set of observations for each test subject? (If so, you should probably be considering panel-type models). I am guessing from your description that "performance" is a continuous variable, but I cannot tell whether you have multiple observations for a given set of (D, T, subject). Can you clarify your data structure?

Generally, regression analysis just "fits" a line or a curve to the data using some "best fit" criterion. That is, it tells you that if the model family you selected were actually generating the data you observed, then the particular model from that family that "best fits" the data is the one with the coefficients that the regression program spits out. The goodness-of-fit statistics tell you something about how big the residuals are, and significance stats on the individual coefficients (or sets of them) tell you how likely you'd be to observe data like yours if the coefficient(s) were actually zero (or negative, or something else, depending on the stat in question), and all are usually heavily dependent on a lot of underlying assumptions. And a lot hangs on that "if" up there - there is often a real art to figuring out which model families to consider in a particular situation (and also what "best fit" criterion to choose). Beyond the data itself, do you have other reasons to expect that the data would follow some particular pattern? You seem to be testing whether the conditional model where D=0 has a different conditional distribution from the model restricted to D=1. Why? Do you have any reason to expect any particular distribution in either case? Is there any independent reason to think a priori that either would be linear? Are all of your subjects tested at all values of T, or is the selection of which subject is given which T condition at least independent (or is it possible that the high-performance subjects are more likely to see condition T=5, for instance)? Are there likely to be other drivers of performance beyond D and T that you are not capturing, and do you know anything about the distribution of those factors? All of these considerations (and more) may influence what models you should consider.

Regression analysis is not easy to pick up on your own, particularly from SAS documentation (which always appears to me to be heavily geared toward obscure populational biostats applications). I can't speak to books focused on your area of study, but for models commonly used in econometric applications, Peter Kennedy's Guide is about the best resource for learning the stuff quickly. Good luck!
posted by dilettanti at 2:39 PM on April 6, 2009


Response by poster: So dilettanti, you are technically right, there are 5 variables (if you don't include different hemispheres, etc).

So basically here's what I've got. Say 11 subjects, each of whom I have performance and brain activity data for T= 0,1,3,5 for D =low and D=high (so that's 8 conditions for each subject). Previous research has shown that performance and activity are tied together, and that for D=low, as T increases, both performance and activity increase. D=1 is sort of the new area we are making suggestions on, and based on a previous behavioral study we've done, when D=high, performance does increase as T increases, but that it doesn't keep increasing, it plateaus off (performance and brain activity both being continuous variables). We are trying to look to see if the brain activity does the same thing, and when you look at the graph, it does, but we need to pull out the significance that says it does, and that's where I am failing.

One big issue is that performance data is highly variable for subjects in this methodology (some are great, some are not), and brain activity levels are often highly variable across subjects. So while I don't actually care to look at the effect of subjects, I might have to take it into account.
posted by katers890 at 3:08 PM on April 6, 2009


but i'm not sure what the coefficients would be

Googling, SAS calls them "parameter estimates." If your output looks like this, you're probably using GLM to run an ANOVA. Regression-world output looks like this.

The regression-oriented world and the anova-oriented world seem to communicate only occasionally. I suspect that you're in the anova world, especially since you're doing bio-ish stuff, and you should talk to other people in the anova world about what to do, and not try to beat a regression-oriented approach into that world. So you should ignore me. I'll keep answering some bits for the sake of completeness, but my real advice is to listen to the people talking from the anova-oriented world.

I've got significance values out (T is always sig, as is D, but the interaction is teetering at p~0.1. Does that say that the as you shift from D=0 to D=1, T is almost, but not quite different?

It means that as you shift from D=0 to D=1, you can be 90% confident that the effect of T changes. It means that if you fit a best-fit line for the observations where D=0, and another where D=1, you can be 90% confident that the population best-fit lines have different slopes.

Or am I having a problem with the fact that for D=0 the data is linear, but for D=1 it isn't (increases as T goes from 0->1 items and then plateaus)?

OLS doesn't care. OLS just fits the best-fit line, even if the relationship is nonlinear.

What is this order logit or ordered probit concept?

Ignore it. I meant to delete that but didn't; it would be relevant if T were your DV.

And yes, if you have multiple observations per subject, you would be doing a regression-oriented model with explicit panel models or with a hierarchical model.
posted by ROU_Xenophobe at 3:13 PM on April 6, 2009


katers890: So dilettanti, you are technically right, there are 5 variables (if you don't include different hemispheres, etc).

Not just "technically" - it sounds as if the entire goal of your study is to study an essentially within-subject phenomenon across T. I would think accounting for the relationship among observations for a single subject could be rather important to consider in your model. Consider "panel" models (particularly the "fixed effects" models) if you're hanging out in the regression world, though as ROU_Xenophobe suggests, you may be better of focusing on ANOVA-type analysis, since your explanatory variables are categorical. I know very little about ANOVA-based analysis (he's right, the two rarely talk, and I learned in the regression world), so I have no idea where to point you for info on how to account for the panel structure in ANOVA analysis. My impression has been that the more common linear regression models essentially become the corresponding ANOVA models when used on categorical explanatory data, but I don't know how well that holds up for panel models. How was the analysis done in the previous research?

It sounds as if brain activity is another dependent variable like performance, so you can just substitute it for performance in whatever analysis of performance you do. So ignoring brain activity, do you have one and only one observation of performance for each subject for each combination of (D, T)? Does your data look something like the following?

+---------+---+-------------------+-------------------+-------------------+-------------------+
Subject | D | performance (T=0) | performance (T=1) | performance (T=3) | performance (T=5) |
+---------+---+-------------------+-------------------+-------------------+-------------------+
|    A    | 0 |            0.2489 |            1.4823 |            2.3234 |            4.0102 |
+---------+---+-------------------+-------------------+-------------------+-------------------+
|    A    | 1 |            0.2543 |            1.9827 |            2.0232 |            2.0324 |
+---------+---+-------------------+-------------------+-------------------+-------------------+
|    B    | 0 |            0.9734 |            1.6448 |            2.8466 |            3.6784 |
+---------+---+-------------------+-------------------+-------------------+-------------------+
|    B    | 1 |            0.2165 |            1.0684 |            1.6484 |            1.5468 |
+---------+---+-------------------+-------------------+-------------------+-------------------+
|    C    | 0 | ...

If so, I'd make that structure clear to anyone you ask for help.

BTW, logit and probit are models for when you have categorical dependent variables, which you don't - so I wouldn't waste your time looking into them.
posted by dilettanti at 4:49 PM on April 6, 2009


Crap - the fixed-width font styling worked in preview. That data layout will make more sense viewed in a fixed-width font...
posted by dilettanti at 4:50 PM on April 6, 2009


Just a followup, because I'm curious:

Why do you have to figure out the statistics yourself? This baffles me. Does your degree program seriously expect you to master your neuroscience topic AND become an expert in statistics? I'm just asking. At my school we have statistics people for that... and when I've consulted with them, I'm always blown away by all of the different suggestions and iterations of analysis that they come up with (and I have a pretty good understanding of the basic/intermediate stat stuff). I couldn't imagine trying to do my research statistics on my own... I'd make too many mistakes, and overlook too many things. You can't be a master of everything.
posted by bengarland at 5:00 PM on April 6, 2009


Response by poster: ROU_Xenophobe and dilettanti, I have tried it in ANOVA land, that is where I normally live and where it does make more sense for me, but my advisor suggested that for what we were doing here we may need to go into regression land, hence my question.

I agree I wasn't particularly clear in my initial question, I was trying to provide enough information without overloading anyone (plus wicked head cold is making thoughts slightly less coherent). But yes, dilettanti, your table is exactly what I have in terms of data. It looks like panel models is where I need to look into.

As far as why I have to do statistics myself and the lovely world of my grad program... well, you can check my history on that one to get my opinion on my grad school. Theoretically my advisor (who has a very strong math/stats background) should be able to help me, but you can get a gist on why that doesn't work if you read my past question history.
posted by katers890 at 5:18 PM on April 6, 2009


If you want to do it in regression land, the right thing would probably be some sort of multilevel model or linear hierarchical model. Responses clustered in subjects, or maybe responses clustered in D levels clustered in subjects.

On the other hand, that's for when you go to publish. For initial analyses that only you and trusted people are going to look at, plain vanilla OLS will almost always suffice. Seems like you've done this with the GLM; it's telling you so far that you're on the right track but it's a bit dicey.

You can do regression or regression-style analyses with an IV like "number of targets."
posted by ROU_Xenophobe at 5:46 PM on April 6, 2009


Response by poster: ROU_Xenophobe can you point me somewhere that might show me how to run something like that and how to interprete the output. I've seen the concepts of multilevel models and what not before, but I'm not sure how to do them.

We are hoping to be close to the publishing land, hence the trying to do it more rigorously than my poor understand seems to afford me, plus the p ~ 0.1 is bugging the crap out of us, because it's not significant, but it's not really not significant : )
posted by katers890 at 6:04 PM on April 6, 2009


BTW, the non-linearity in the D=1 case is somewhat troubling. If that feature is important to capture in your analysis (and it sounds like it is), my inclination is to suggest estimating a linear spline with one knot (a spline is a piecewise polynomial - think two lines or curves that are joined at a point, so the slope of the spline can change at the knot). Unfortunately, I have reached the limits of my knowledge, so I can't be much help with that. In SAS, PROC TRANSREG allows spline estimation, but I don't know how splines can be fit within a multilevel model. There is a PROC PANEL in the SAS/ETS module, but I don't know if it has a spline option - I doubt it, even though SAS likes to make all of its stats procedures mostly redundant with most of the other procedures. Of course, the more ways you look at the data, and the more estimates you can report, the better, right? So maybe segment the data on D, run OLS separately for each value of D and compare results, run a one-knot linear spline regression separately on each D and compare, run a fixed effects model on the full panel, etc. Then just report them all and discuss.
posted by dilettanti at 6:47 PM on April 6, 2009


ROU_Xenophobe can you point me somewhere that might show me how to run something like that and how to interprete the output.

Marco Steenbergen and Brad Jones have an explanatory article in American Journal of Political Science (or American Political Science Review or Journal of Politics), and Brad has (or used to have) a series of lecture slideshows and notes available online. I'm sure there are people doing it with bio or psych, but I don't do that.

Honestly, if you're unfamiliar with the regression-oriented world I'd start with vanilla OLS, move to vanilla OLS with clustered SEs, and move on from there. Likewise, I wouldn't bother with a spline or other ways to induce nonlinearities (logs, squared or other polynomial terms) until you're comfortable with vanilla OLS. Trying to eat everything through HLMs with nonlinear effects all at once makes for a big, big bite.

You can also, as dilettanti implies, run separate models for D=0 and D=1 and then do hypothesis tests between the coefficients on T in them both by extracting the variance-covariance matrix from each run.
posted by ROU_Xenophobe at 8:18 PM on April 6, 2009


Response by poster: Thanks guys!
posted by katers890 at 6:37 AM on April 7, 2009


« Older Help me clean my laptop   |   Am I throwing my money away on my house!? Newer »
This thread is closed to new comments.