To R or not to R
August 12, 2013 10:59 AM   Subscribe

My office has recently had some funds open up and we are looking into investing in some statistical software to make our lives easier. We do a lot of work with distribution fitting, Monte Carlo analysis, and regression analysis with data sets that may contain left or right censored data. Unfortunately, we only have a few days to identify the best software package for our buck. Alternatively, the idea has been floated to download the free R software and spend the money on some training to get over the steep learning curve. What program or approach would be the best use of our money?
posted by C'est la D.C. to Computers & Internet (23 answers total) 10 users marked this as a favorite
In my opinion, it depends how worthwhile climbing the learning curve would be. For me, the calculus would be how many people need to be able to use the software. Would you be training new people often? Do the people being trained have background in programming at all?

If it's a lot of people with various backgrounds (i.e., not programmers/statisticians), I'd say SPSS. But R would probably be ideal.
posted by supercres at 11:04 AM on August 12, 2013

Response by poster: The software/training would be for just a few people. We all have strong math/statistics backgrounds with variable but low levels of programming knowledge (java or matlab). It is anticipated that this would be a one time event and that those trained would pass down the knowledge to new hires.
posted by C'est la D.C. at 11:09 AM on August 12, 2013

A lot of companies are going with STATA lately. I have heard it is more user friendly (especially for non-programmers) than R, SPSS or SAS but I have yet to try it myself (a colleague seems to think it is superior than the latter three, but I am still a SPSS fangirl myself). I also hear it is more affordable in terms of licensing. May be worth a demo/phone call!
posted by Young Kullervo at 11:18 AM on August 12, 2013

I can't speak to any specific programs but I learned R the same way I learned Stata and SAS - by jumping in with both feet. R isn't the easiest statistical language to learn so unless you plan on using immersing yourself in during and after training it's probably not the best place to start. On the plus side R is free (unlike SPSS or my beloved Stata), but it has a seriously steep learning curve. We got a quote from Revolution Analytics to do some two day training for us (three people) and I think the quote they have us was around $15K.
posted by playertobenamedlater at 11:20 AM on August 12, 2013 [1 favorite]

Stick with the open source choices. It is seriously a pain to constantly have to rework old stats work every time a vendor forces an upgrade with syntax changes down the pipe. With open source versions you never have to worry about the old version going away.

SPSS has proven itself to be a pain in the arse for many researchers with syntax changes upsetting backwards compatibility and version conflicts between researchers at different institutions on different upgrade schedules. Not to mention the licensing issues. You want to do stats. You don't want to spend your time redoing what you have already done or playing system administrator licensing games.

R is the way to go forwards and backwards.
posted by srboisvert at 11:29 AM on August 12, 2013 [2 favorites]

Is focusing on R going to hurt your office somehow? You can reproducibly share statistical information quite easily once you learn the language.
Here are two interesting examples of this communication in action.
posted by oceanjesse at 11:31 AM on August 12, 2013

If you have minimal programming knowledge R shouldn't be too hard to pick up. It uses basic logic for most of its stuff. Its also a very good tool for most things. It's data browser isn't great, at least by default, so other programs have an advantage on it on that, but its incredibly flexible. If there are only a few of you then its perfect: as stats get spread across an office easier programs become more useful.

I'm not convinced Stata is that much easier to use than R, and it isn't free.
posted by Cannon Fodder at 11:33 AM on August 12, 2013

R is pretty good. Personally, I'd use RStudio.
posted by dfriedman at 11:36 AM on August 12, 2013

I've used Stata, SPSS, and SAS, and also dabbled with R, and I would say SAS has the steepest learning curve of the bunch (and also the most expensive to buy).

Personally I feel that Stata hits the sweet spot of user friendliness and power. I think it's easily learned by people who already have math/statistics backgrounds, and I have found the quality of texts and tutorials available to be excellent.

Would there be time for people in your office to download R and get an evaluation copy of Stata and play around with both a bit?

(if you decide to go with Stata, please also allocate funds to purchase The Workflow of Data Analysis Using Stata, which is about how to get your work done using Stata).
posted by needled at 11:37 AM on August 12, 2013

I feel very comfortable using SPSS, but the regular licensing fees can be annoying and their customer service isn't great. I've been trying to move toward more use of R. There are programs like R Commander that provide a user interface that's more like the point-and-click of SPSS, so that might be something to consider. As far as training, I did a 6-week online course through Virginia Commonwealth that was around $300. It was very helpful in just showing me how to find things and do the basics.
posted by bizzyb at 11:39 AM on August 12, 2013

If you do end up going with R, you might find this to be useful:

As using regular Google to try and search for R related things is painful at best.
posted by foxfirefey at 11:41 AM on August 12, 2013 [2 favorites]

I recommend Stata. I use both R and Stata regularly, and I think they both have their plusses and minuses. The big minus with R is that it has a very steep learning curve for folks like you guys (and me) without a strong programming background. I also find that if I'm away from using it for a while, it takes me a good chunk of time to get back in the swing of remembering how to do even basic things. I think it is difficult enough to learn that it wouldn't be easy to just 'pass down' to new employees - I think you would really need to recruit for people who already know it or have a dedicated training program type thing.

With Stata, there is basically no training needed - I regularly teach it to undergrads with only basic stats knowledge in a 60 minute workshop, and I myself learned it by reading through a manual and just playing around with the program.

The other thing about Stata is that it is 1000x easier to clean data and put together datasets than R. I know no one who is willing to spend the time and energy to clean data in R -- I did it once in a class that required it, and never again. Even really dedicated R users who swear by it and turn their noses up at Stata still use something other than R for data cleaning. So, if you regularly work with data that you have to piece together/alter/clean before analyzing, Stata is the way to go. R is more powerful ONCE that data is cleaned, but it sounds like you're not doing anything hugely complicated with your data anyway.
posted by rainbowbrite at 12:21 PM on August 12, 2013 [2 favorites]

I think the right answer here is that you should use whatever package seems most prominent among the bright young things in your intellectual community. This will give you access to whatever packages they end up creating that support whatever the weirdo models for your field are. And it will give you the best informal software support from within your own community and with reference to the kinds of models that you typically run.

If there isn't any clear winner among your Bright Young Thing set, then I think that Stata is easier to live with and get ordinary work done in than R is. That is, you'll probably be spending most of your time running the vanilla versions of canned models, and the workflow for that is simpler in Stata than R. R's object orientation is great if you're doing something complicated and chaining some of the output from this thing into an input over here and so on, but it's annoying when it won't show you regression output after running the regression unless you specifically poke the output object.

I will say that R isn't particularly hard, at least not at the level of running canned models. Its learning curve, at that level, isn't really any steeper than other command-line oriented packages like Stata.* There are just more command-line things you have to do in order to do stuff. More at the level of "consistent but minor pain in the ass" than actual difficulty, if you see what I mean. I wouldn't recommend springing for training right out of the gate, except maybe for people who would need to write new packages, etc, within R.

*There are point and click interfaces within Stata. What I tell students in my intro stats courses is that they are temptations sent by Satan, and that the one true and holy path is to write .do files that input virgin datasets and do whatever you want to do.
posted by ROU_Xenophobe at 12:31 PM on August 12, 2013 [2 favorites]

I have used R, SAS, and STATA and if you can program in Java or Matlab, you can definitely learn to use R. The previously mentioned RStudio even looks like Matlab. I think R has a couple main advantages:

1) Free and you don't need to worry about limited licenses. A team of people sharing 1 or 2 licenses gets annoying fast. If you have change your data storage policy, you'll always be able to get R on the same machine that the data is on.
2) Online resources are the most supportive. Check out coursera, code academy, etc. There are many free courses in R and a ton of great forums. Support for SAS and STATA are more formal -- more published texts, fewer interactive tutorials.
3) Packages. For what you're describing, you can choose from an ever-growing number of R packages that probably already do exactly what you want. With STATA, you're more confined to whats included with the software. There are user-created modules, but these are generally less robust than what is out there for R. This is because there's a growing community of university statisticians who create really thorough libraries for R.

I think the last thing to consider is the industry you are in, and the goals of the group that are using it. What skills would be the best to learn for the next step in your careers? STATA is a great skill to learn if you're going to be working mainly with medical data or economic models. But in business analytics, I think R is much more popular. Pick the language that is going to be the easiest to find in new applicants, and the most useful to senior team members.

(On preview, ROU_Xenophobe says the last point better.)
posted by tinymegalo at 12:33 PM on August 12, 2013 [2 favorites]

Pick the language that is going to be the easiest to find in new applicants

I want to take this home and hug it.
posted by ROU_Xenophobe at 1:39 PM on August 12, 2013 [2 favorites]

What are you using now? Why do you want to switch to something else? Without knowing that it's tricky to answer your question, but I'll make a case for R anyways.

R was more or less made for people with strong math/stat backgrounds and low levels of programming ability. A lot of the initial learning curve is the kind of thing that goes with any programming language, so if you're even a little familiar with matlab or java, things should be easier.

The other argument for using R is that if you need to do something that's even a little nonstandard, R is more likely to let you change the default values or have a package that does.
posted by matildatakesovertheworld at 2:10 PM on August 12, 2013 [1 favorite]

I will add that I think R is on the upswing still, since its open-source nature means that it can expand and adapt much more quickly than the commercial packages.
posted by zscore at 2:12 PM on August 12, 2013

"if you can program in Java or Matlab, you can definitely learn to use R."

Exactly what I came here to say.
posted by mikeand1 at 2:35 PM on August 12, 2013

also, props to these guys matildatakesovertheworld, ROU_Xenophobe, tinymegalo
posted by zscore at 2:38 PM on August 12, 2013

I wouldn't spend the money on proper training -- I'd send it to some Rstats guru, have them develop some well commented scripts that deal with your data (reading it in, doing regressions, reporting results, maybe some visualizations) and then study those. If you don't who to ask, i'm sure stack overflow or the #r channel on freenode could help with that. The amount of data you need to process is pretty important; if it's in terabytes you might want to contact revolution analytics or some other big data guys..
posted by 3mendo at 2:44 PM on August 12, 2013

I will add that I think R is on the upswing still, since its open-source nature means that it can expand and adapt much more quickly than the commercial packages.

Stata is very far from open source, but its language is open enough that there's a vast user-created library of .do and .ado files out there. Not as big as R's, almost certainly, but it would be wrong to think that it being closed-source means that you're stuck with what Statacorp gives you.

For that matter, most packages/commands from Statacorp are written as normal .do / .ado files, so a lot of it actually is almost-open-source, in the very limited sense that you can see the code for the various commands.
posted by ROU_Xenophobe at 3:01 PM on August 12, 2013

I'm a beginner to R, but I use this GUI called Deducer that makes it a whole lot easier.
posted by dhruva at 5:42 PM on August 12, 2013

Another plug for Stata. The GUI is simple and relatively intuitive so the learning curve isn't steep (honestly! I've seen undergrads with zero programming language and a simple grasp of statistical methods learn it in a couple weeks tops). And as mentioned, there are a lot of .do and .ado files around for the power users.

For training materials, I like A Gentle Introduction to Stata, by Acock.
posted by epanalepsis at 5:47 AM on August 13, 2013

« Older New workout   |   round corner punch in Toronto Newer »
This thread is closed to new comments.