Statistical analysis package for OS X needed.
March 17, 2006 8:18 AM   Subscribe

SPSS is fine but produces offensively ugly figures. R has nice, clean output but is otherwise inscrutable. Statistica is, alas, Windows-only. Mathematica isn't really it, either. What statistics package should I use for OS X?

I am in the sciences and need to do basic statistical analysis up to and including regressions. Hopefully, I will be doing more and more statistical analysis -- if the gods of grad school align as I hope -- and need a package that is adequately open-ended to include newer techniques like Bayesian and MCMC.

What should I use?
posted by docgonzo to Science & Nature (21 answers total) 4 users marked this as a favorite
 
Have you tried Stata? I think it's fairly common in biostatistics.
posted by milkrate at 8:32 AM on March 17, 2006


Best answer: This is one of those topics where the best advice I can give you might not be what you want to hear, or directly address your question.

What should you use? If you're going to continue on and expect to eventually be doing Bayesian, MCMC, etc, you should suck it up and learn R, because R is extensible and often extended and can do ~anything. It can also be intensely annoying, yes. You might look at R Commander to make the initial learning curve shallower. But I recommend not using GUI interfaces unless you need them as they can encourage bad habits.

I haven't seen Stata for OSX, but Stata has reasonably comprehensive statistical capabilities and decent graphing in its Windows incarnation. R can do things Stata can't, though.

For simple stuff (OLS / probit / logit with no real bells or whistles), gretl has good graphing with a lot of standard diagnostic plots built in. All it really does, though, is generate gnuplot plots.

More broadly:

Sometimes ya gotta use separate graphing/plotting and statistical software, there's no way around it. Either the software won't graph well (and you have to use it because it's the only one that will do the two-stage three-level multinominal logit with chroniton particle correction that you need), or it won't easily generate the required graph. So anyway, you should expect to sometimes have to switch from analysis in one package, to generating and outputting the required information, to graphing in another.

Gnuplot is more or less standard for that, if not really friendly. R can also be used to generate nice plots. Excel does decent graphs.
posted by ROU_Xenophobe at 8:51 AM on March 17, 2006


Best answer: In my mind, the biggest problem with R is its crappy online documentation. Any of the books will be much, much better. My favorite is An Introduction to S and S-Plus by Phil Spector. But there are other good books out there as well. These should reduce the learning curve greatly.
posted by grouse at 8:54 AM on March 17, 2006


Are you sure Mathematica isn't really it? I'm not a stats guru by any stretch, but I am having a hard time imagining what you need that Mathematica can't do - particularly with some of the powerful statistics packages that are available. Someone has already written a package to do whatever it is you want to do. Anyway, if you pin down why Mathematica isn't acceptable it might help others make more informed suggestions.

Wolfram's support for OS X is very good, too, unlike certain other popular cas's beginning with 'M'.

Try this koolaid, too. It's fantastic.
posted by Wolfdog at 9:02 AM on March 17, 2006


To my mind, Mathematica is the best choice, but I'm weird.

Let me explain why: Mathematica is fully OS X native (other than the open source R, I think all of the other stat packages either don't support the Mac, or are stuck at a dead version several revs behind). More importantly, Mathematica is fully programmable. With Mathematica, I'm not locked into any one vendor's idea of what is an important statistical function. If Mathematica does not have it, I can make it have it.

For instance, Mathematica did not have box plots prior to 5.0, so I just wrote my own box plot function, and voila! Mathematica had box plots. Ditto any weird probability mass functions or whatnot. I also find that programming a certain statistical tool really helps me learn and understand that tool, and it can be done quickly in Mathematica.

That said, most people find Mathematica really hard to learn. For whatever reason, I don't, and I've learned it really, really well. It is very extensively documented, and that documentation comes with the package. For any kind of mathematical programming, once you actually learn it, it is a very good fit, and its graphics package is 100% programmable, which can also be a major strength. It's also really expensive.

Other than that, I'd say you are probably going to go with R (which I think is fairly programmable, too). And use gnuplot to make nice graphs. It's worth your time and effort to learn to use a tool.

In the end, I find taking the steep learning curve of a command-driven tool like R or Mathematica far outweighs the initial gain of going for the "instant" gratification solution of something like SPSS or (to a lesser extent) Minitab, as those tools are ultimately less flexible. Or to the extent that they are scriptable, it's a kludge and not nearly as natural as I would like.

I've also applied for jobs that used SAS exclusively, but I'm not even sure what it really does, and I don't think it is available for the Mac.
posted by teece at 9:52 AM on March 17, 2006


Are you sure Mathematica isn't really it?

I have not tried to use Mathematica for statistical modeling, but:

Yes. Mathematica is not it. docgonzo wants to input data and get coefficients for a wide variety of models.

Mathematica, even with extensions, seems very ill-suited to that task. Even reviews enthusiastically recommending Mathematica + Mathstatica limit their endorsement to primarily teaching settings, noting that it's not really good at straightforward number-crunching.

Statistical packages are optimized for the task that docgonzo has in mind. Entering a model is more straightforward, and the solvers and optimizers are, well, optimized. I've seen reviews that note that models take 100--300 times longer to converge with Mathematica than with [some standard package].

Mathematica is fully OS X native (other than the open source R, I think all of the other stat packages either don't support the Mac, or are stuck at a dead version several revs behind)

(1) That's not so. Stata and SPSS at least are up to date.

(2) If you need to use another OS to use the right / best software, then you need to use another OS for that purpose. In that case, your reaction should be to learn the rudiments of that OS and use it as needed, not to refrain from using the right / best software because that would mean using an un-preferred OS. docgonzo should expect that (s)he will have to deal with tools that only run under Windows or specific unix workstations from time to time.

With Mathematica, I'm not locked into any one vendor's idea of what is an important statistical function. If Mathematica does not have it, I can make it have it.

R and Stata are likewise very extensible.
posted by ROU_Xenophobe at 10:26 AM on March 17, 2006


I don't know much about stats, but I do no that Prism, by Graphpad, is extremely easy to use and makes reasonably attractive charts with minimal interference from the user, and comes with a book to teach ignorant biologists like me what all those funny statistical terms are supposed to mean.
Perhaps either Prism or their other stats-for-scientists product, InStat, might be what you're looking for?
posted by nowonmai at 10:39 AM on March 17, 2006


Best answer: Excel does decent graphs.

So, ROU Xenophobe, just when did you become best friends with crystal meth?

Excel does horrible graphs, mostly because MicroSoft has those awful default behavious that cannot be changed. Ok, pie graphs are ok in Excel, but I wouldn't go anywhere near a journal with any of their other graphs. Know too that regressions in Excel don't work properly, Microsoft decided to write a math library that was not compliant with IEE 754. Excel is disrecommended for use by just about every certifying body out there.

For the record, I agree with just about everything else the crack-head says. I favour the suck-it-up-and-learn-R camp (and you should be writing in LaTeX dammit), but using SPSS then doing your graphs in something else (I use Sigmaplot on a PC, but Origin is good too) is equally valid.
posted by bonehead at 10:49 AM on March 17, 2006


I've seen reviews that note that models take 100--300 times longer to converge with Mathematica than with [some standard package]

Most such reviews are very, very suspect. Most people have no idea how to program Mathematica, and as such, make horrible, slow Mathematica programs.

I used to know people in the physics department that thought Mathematica was horribly slow compared to their tools of choice (either Matlab or Fortran). And Mathematica IS slower than those two -- but 90% or more of the pokeyness was the result of their bad programs.

Most folks that write Mathematica programs don't understand the performance hit they take by a) not writing a list/functional program, b) not eliminating purely symbolic elements, and c) not using machine-sized numbers, rather than arbitrary precision numbers.

As such, most Mathematica benchmarks suck.

The end result of actually learning Mathematica (particularly version 5) is that it is fast enough for a lot of uses, all except for the most demanding.
posted by teece at 11:08 AM on March 17, 2006


Best answer: Two last comments:

Mathematica, while excellent for discrete mathematics and solving PDE's and the like, is really awkward to use for stats. It's also hugely expensive (if you're not using the crippled student version). Don't use an expensive powertool to drive a nail.

For learning R, I've found books to be my best resourse. One that's worked for me is Maindonald and Braun, Data Analysis and Graphics Using R, ISBN 0-521-81336-0, 2003.
posted by bonehead at 11:09 AM on March 17, 2006


(1) That's not so. Stata and SPSS at least are up to date.

Don't know anything about Stata. Sounds cool.

SPSS is at v. 11.0 for Mac, Windows is at v 14.0. Certain key features of the Windows version are missing on the Mac version.

I've never used SPSS for the Mac, but I get the distinct feel that it is a second class citizen compared to its Windows big brother, and would be very uncomfortable relying on it on that platform. For instance, I'll be pretty surprised if they release an Intel-native version of SPSS for the Mac within the decade. Maybe I'm wrong.

By contrast, R and Mathematica are essentially the same on most platforms on which they run.
posted by teece at 11:16 AM on March 17, 2006


I'm an R fan but I agree that you should probably get a book or check out the tutorials. The embedded manual is not that great. Having done SPSS and R coding, R is a better language, SPSS is a better GUI interface.
posted by KirkJobSluder at 11:47 AM on March 17, 2006


Most folks that write Mathematica programs don't understand the performance hit they take by a) not writing a list/functional program, b) not eliminating purely symbolic elements, and c) not using machine-sized numbers, rather than arbitrary precision numbers.

One might say that having to worry about such things are three more reasons why Mathematica is not appropriate for this task.
posted by grouse at 12:13 PM on March 17, 2006


So, ROU Xenophobe, just when did you become best friends with crystal meth?

I prefer polydichloric euthymol. Or nuke if I can't get that, but it makes my teeth chatter.

Excel does horrible graphs, mostly because MicroSoft has those awful default behavious that cannot be changed.

I haven't had any problems turning off all the crap, removing all the horrible titling and boxing and such, and ending up with a bare figure that LaTeX can title, but I've only used it for very simple data plotting. I would never suggest that anyone do statistical work with Excel. But it's okay (if suboptimal) for plotting a data series that you've generated elsewhere.

Most such reviews are very, very suspect. Most people have no idea how to program Mathematica, and as such, make horrible, slow Mathematica programs.

That's sort of my point. If docgonzo is going to grad school, (s)he should be concentrating on the appropriate subject matter and not on programming Mathematica. Especially not on programming Mathematica to solve problems that there are already canned, accurate, and very fast solutions for, such as finding regression-style coefficients and SEs for a wide variety of dependent variables.
posted by ROU_Xenophobe at 12:14 PM on March 17, 2006


One might say that having to worry about such things are three more reasons why Mathematica is not appropriate for this task.

It's entirely appropriate -- it's just harder. I learn a hell of a lot by making things myself in Mathematica, and I don't consider it time wasted AT ALL. (I'm one of those silly purists that thinks if you're going to do statistics, you should do statistics, and learning to make the computer do statistics is part of it. Letting someone else figure out how to make the computer do statistics, and using their work, is not for me).

That's sort of my point.

Right, but it was also my point. Like I said, I'm weird -- I laid out why I like Mathematica. It's a great tool, and I can do anything I want with it. It ain't that hard to learn Mathematica or program it to do very advanced statistical analysis -- I've done in it, and I'm neither a super genius nor super motivated.

All I'm trying to say is that programming the analysis yourself is very rewarding, and that Mathematica is one of the (probably two) best tools out there for doing that. You don't have to like that or do it that way, but I wanted to throw it out there.

And I've never been at a disadvantage to someone that was using R or SPSS or whatever tool they choose.
posted by teece at 1:14 PM on March 17, 2006


This is the point in the discussion where I infer that my use of JMP is very, very misguided, right?
posted by dmd at 1:56 PM on March 17, 2006


If all you're going to do is regression, then go with a canned package. But if you are going to do anything more sophisticated, I'm also going to vote for sucking it up and learning something with a real programming language. I seems hard to believe you won't find yourself locked in by someone else's implementation decisions otherwise.

R is free, powerful, and seems to be favored by the statistics community. As such, almost any obscure analysis you can think of is already implemented as a CRAN package.

My weapon of choice is Matlab. Not free. (But someone else pays for it so what do I care?) The Statistics Toolbox doesn't have as much implemented as R, but it meets my needs. I favor Matlab becuase 1. It's more of a general purpose product than R (lots of signal processing stuff included, for example) 2. I think it's an easier language to write in (possibly R could be as easy if I practiced more), meaning that implementation for me is really fast 3. It's just what's used around here.

There used to be concerns about Matlab's performance but I think those are largely gone these days, unless you're doing something truly hardcore.

The OS X versions are kept up to date, too. I guess the only thing that bothers me about them is that they use X11 rather than Aqua for their windowing, but that's not a huge deal IMO.
posted by epugachev at 2:18 PM on March 17, 2006


This is the point in the discussion where I infer that my use of JMP is very, very misguided, right?

I wouldn't say that anyone's choice of package is misguided as long as they can get done everything they want to do. Some people dislike canned packages because they believe that their existence facilitates the abuse of statistics, but it seems like it ought to be possible to both understand what you're doing and rely on someone else for your implementation. People have different priorities, and those people whose priorities prevent them from doing a huge amount of programming ought to be able to benefit from the work of others. That's the division of labor at work. I just wouldn't be able to do everything I want to do (or perhaps I could do it, but very inefficiently) if I only had prepackaged tools at my disposal.
posted by epugachev at 2:48 PM on March 17, 2006


This is a hard question, because a lot of people have a lot of different opinions and they all get there eventually.

I do analysis with SPSS or rarely SAS when I can't avoid it; graphing with Excel or Harvard Graphics.

The guy whose mad skills really blew me away used Matlab for everything, but he'd spent most of a 7 year PhD learning to use it. I wasn't able to get a foothold on its steep learning curve.
posted by ikkyu2 at 10:03 PM on March 17, 2006


Also, and I hate to say this, but as a 24-year dyed in the wool Apple enthusiast, I have found that confining yourself to OS X for these purposes handicaps you. It's much easier to do everything on one platform, and there are things you can do with Windows software that can't be done with what's available on the Mac platform.
posted by ikkyu2 at 10:05 PM on March 17, 2006


Response by poster: Wow! Thanks for the amazing feeback on this question. I am sorry that I wasn't able to jump back in yesterday but all of the points or issues that I have have been addressed once or twice.

To respond, I partially came here suspecting the answer would be "suck it up and learn R" but hoping someone would say "no, no, no, not R, use x." Certainly everyone in my department/lab right now (molecular ecology/sequence analysis) thinks R is the best thing if you have to do hardcore stats, but few have learned it as there are canned packages (structure, Arqlequin, etc.) that do specific analyses.

Grad school will (hopefully) be epidemiology so I think I'll need the option to do custom, fine-grained analyses. I think I'll pick up one of the R books suggested here -- the built-in help is quite atrocious -- and go to work that way.

Thanks for the help!
posted by docgonzo at 6:08 AM on March 18, 2006


« Older Fantasy baseball newbie   |   Offline Browser programs Newer »
This thread is closed to new comments.