What scientific software do I use?
March 23, 2008 10:58 PM   Subscribe

GeekyScientistFilter: Help me find the right software for my research project.

As summer nears, I'm starting to get organized for my senior research project (bio and chem theses). I've used Word and Excel for most of my labs, and occasionally LaTeX, but do professional biologists use MS Office? With that in mind, I'm looking for software recommendations for the following:

1. Data storage and analysis. I've used Excel forever, but does the professional science world use it?
2. Graphing. I've only used Excel.
3. Composition. Is LaTeX the de-facto standard in reports? I've been using LaTeX for a year, but it seems like I spend more time formatting the document than writing it. I like LaTeX for citations (BibDesk!) and because I can use version control (mercurial), but inserting figures and data tables confuses the hell out of me.

Requirements: Leopard-compatible, good documentation, and free.

Sorry for the vague question - I have no idea what I'm researching yet. That's for April :).
posted by fleeba to Computers & Internet (28 answers total) 8 users marked this as a favorite
I don't know if biologists use it, but Matlab is easy to learn and makes nice graphs. It can also get data from and give data to Excel, which is handy. I think the student license is cheap and you may be able to get it for free through your school.

You can do heavy number crunching for free in R, although I think there's more of a learning curve there. I've only played with it a little.
posted by xbalto at 11:38 PM on March 23, 2008

Data storage and analysis. I've used Excel forever, but does the professional science world use it?

Biologists use a mix: text files, databases, spreadsheets. Analysis seems to be done with Aabel, Excel, Sigma Plot, R, etc... i.e., generally whatever the researcher is familiar with, and whatever applies to the task at hand.

Graphing: R, Matlab, Excel, Sigma Plot. R has very flexible graphing capabilities but requires an investment of time to learn its language. It's free, while its competitors are not. IMO, this is a good starting book on R, and this is a great reference.

Composition. The only biologists I've seen who use LaTeX are computational biologists. Mainly because of the heavy lifting required to do some imaging, when pretty tables, figures and graphs are what my PI says "sell" papers. Endnote and Word seem to be the defacto standards.
posted by Blazecock Pileon at 11:46 PM on March 23, 2008

As a biologist, I can tell you that the sort of journals I try to publish in often don't even accept LaTeX anymore. I'm ashamed to say, we all use Word and Endnote. So, within the people I work with and have worked with, LaTeX is virtually unknown, and everyone has MS Office.

Data storage - Excel is actually a very useful tool for converting between formats, doing quick mucking about with data. But for "serious" data storage, I generally use Access, because a lot of the other software I use (R, Python, ArcGIS) talk to it pretty easily, and it's really not a bad piece of software, either. Ultimately, a lot of the data I work with ends up as CSV files, or Access tables.

Seconding R - it's the future of statistics, as far as I can tell. And it can be good for graphing, although there is a bit of a steep learning curve for that. Take time off learning LaTeX, and learn R instead.
posted by Jimbob at 11:47 PM on March 23, 2008

I'm a grad student physicists who's unfortunate duties are selecting and compiling software that is put on all the laptops that are given to undergrad students.

Data storage? you will be amazed at how you will come to love flat files that separate each field by a comma since almost all programs accept them. Usually what happens is your measuring device outputs in some weird format such as first 2 fields separated by & the rest by tabs except for the last field is separated by a | and it uses commas as decimal places. You use a combination of Excel and OpenOffice(REALLY good for the conversions into diffrent formats) to get it into the format you want so you can data process it.

Analysis? Mathematica(not free), MatLab(not free), fityk (free) for peak analysis not my favorite thing but people swear by it, Excel/OpenOffice for quick linear fits.

2. Mathematica will give you some of the best 2D plots you will ever see, MatLab for 3D plots (Not free). On the free side gnuplot.

3. I don't know of a single journal in my field that excepts anything but latex. You really should have a template by now that has almost all the formating done for you however you really do need to pick up a book on Latex period. Tables are easy however figures are the bane of peoples lives in latex. If your issue is making figures stick to where you want them(most peoples) dont follow the proper figure definition(it allows the figure to float). What you want to do is just do a \scalebox{1.00}{\includegraphics{graphic.jpg}} this keeps it where you want at the cost of you not being able to internally reference it.
posted by metex at 12:09 AM on March 24, 2008

I don't know of a single journal in my field that excepts anything but latex.

And here is the big argument, and probably the big distinction between, say, physics and biology. As I said, I haven't come across many journals (particularly smaller, specialist ones) in my fields that do accept LateX.
posted by Jimbob at 12:15 AM on March 24, 2008

do professional biologists use MS Office?

Oh yes, most definitely. Papers are written in Word, posters are made in PowerPoint, Office 2003 is our go to application. References are stored and formatted with either EndNote or RefMan, where EndNote is more common. You don't need to change anything there.

Data analysis is done in whatever you have access to or is appropriate for your field. R and python and whatever are certainly useful if you do that level of data crunching but it really depends on what kind of information you're producing. The lab you go into should have this kind of thing up and running already and you'll learn it as you go along.

I've seen data storage in Access or custom built Oracle databases but I personally don't need to go beyond Excel spreadsheets (note: someone else does my bioinformatics, they use a database). You can do a lot with a good spreadsheet, particularly once you get the pivot tables working. I prefer Office 2008 for this. Learn to be an Excel poweruser.

The main thing we don't use Office for is graphing and this is where you'll need to do some learning. Excel charts aren't suitable for publication, stop relying on them as soon as you can. Graphing software seems to vary by organisation, what they have access to and what they're comfortable with. Personally I like Sigma Plot (although my current company uses S Plus) but there are several options and they're not too dissimilar. Find out what your lab will have available and put in some time learning that. Often it's standard across a university, i.e. there's at least one programme available to all students, and the graduate center or teaching unit or IT helpdesk or someone will offer classes in how to use it. Find them and take them because being able to make decent graphs will stand you in good stead forever after.
posted by shelleycat at 12:25 AM on March 24, 2008

I'm an engineer, so that might be different. But

1. I prefer CSV files, since they're really portable. They can be read by a bunch of different spreadsheet programs and can be easily manipulated with a Perl script or something. Excel saves in this format, so it'll totally work.

2. Excel also works for plotting. In the course of things, I've also used Matlab, Mathematica - which I thought were good for graphing whatever calculations were done in those programs. For flat out plotting, I like Excel - but I also like Kaleidagraph, since there's much less formatting that needs to be done to avoid that "Oh that guy did it in Excel" look. It's pricey, though, and for an undergrad thesis Excel will be fine.

3. No idea! Sorry.
posted by universal_qlc at 12:30 AM on March 24, 2008

You can do a lot with a good spreadsheet, particularly once you get the pivot tables working.

I would actually recommend not wasting time with Windows Excel-only pivot tables, and instead learning SQL in (free, multi-platform) MySQL. Pivot tables are really just a subset of a certain type of SQL SELECT query, called an "aggregation". But you can do so much more with SQL, and it is a skill you'll be able to use in pretty much any lab you go to.
posted by Blazecock Pileon at 12:39 AM on March 24, 2008

instead learning SQL in (free, multi-platform) MySQL.

In my experience Access is a good intermediate option. It provides an easy interface to inport data, muck about with tables, export data, but lets you do SQL queries and that sort of stuff too. MySQL just doesn't have a nice interface to it, it's just a server, although there are some third-party interfaces available that can give an Access-like frontend to it. The database part of OpenOffice can be quite good for this.
posted by Jimbob at 1:24 AM on March 24, 2008

Nth-ing much of what is above. Biology is a Microsoft shop: papers are almost universally done in MSWord with Endnote, a lot of data is kept in Excel. Journals will accept MSWord, and maybe RTF and/or PDF. There are exceptions but they are just that - exceptions. Learning LaTeX just for a summer project would be foolish, and I'm unpersuaded that many biologists would benefit under any circumstances. (Admission: I did all my theses in LaTeX and am glad I did. But I had a hell of a lot of equations and special formatting, and it took a lot of sweat and tears.)

Visualisation is more varied. Looking for a decent analysis and plotting tool is a common topic of conversation in biology, and one that is effected heavily by what platform you are on. Excel plots look like shit, but a lot of people still use them. Otherwise JMP, Stata, Sigma Plot and others have a good reputation and do decent output. R would be a neat skill to acquire, but in a short project you have to make a judgement about investment of time. My general advice would be to see what the host lab is using, because then you will have access to the software and local expertise.

As for data storage, without knowing anything about the data you will be generating (obviously terabyte data generation has it's own problems), Excel isn't a bad choice. Access is a decent second choice, Filemaker has some mind share. The main point is to ensure that you can get data in and out of your chosen storage format with a minimum of pain. With Excel. Access (and CSV / TSV files), that's not a problem.
posted by outlier at 2:48 AM on March 24, 2008

Access doesn't work in Leopard, at least not without:

• an Intel Mac
• either Boot Camp, Fusion, or Parallels
• a license of Windows XP or Windows Vista
• a license of Office 2003 or 2007 Professional

Unless the asker has a sizable budget, I'd recommend staying away from Access.

Even then, Access has some significant usability restrictions. If a database is useful, better to stick to a non-Microsoft database engine that you can import to and export from, on which you can perform (and learn) standard SQL queries. There are many free options (MySQL, PostreSQL, even Open Office's Base).
posted by Blazecock Pileon at 2:58 AM on March 24, 2008

I'm a biologist who uses Word and EndNote and a mixture of Excel and Access (I have to jump off of my Mac to use Access, but I'm learning how to use Base--a database is indispensable for large, environmental datasets). I use JMP for stats because it's free from my university (as is EndNote). Our lab has SigmaPlot for complicated graphs, but I find it a real pain to use and would love to find an alternative. Depending on the size of your literature review, you might also find Papers useful--I'm an compulsive PDF hoarder and I find it indispensible.

The most important thing is to find stuff that does what you need and does it easily--you don't want to waste your time fiddling with software.
posted by hydropsyche at 3:37 AM on March 24, 2008

Didn't realize that, Blazecock - I figured the MS Office suite was the MS Office suite, whether you're on Windows or Mac.
posted by Jimbob at 3:39 AM on March 24, 2008

I figured the MS Office suite was the MS Office suite, whether you're on Windows or Mac.

Unfortunately, no. In fact, Office 2008 gets rid of what little of Windows Office's Visual Basic compatibility was in Office 2004, and Excel 2004 does not have pivot tables. Excel 2008 has a stripped version of Excel 2007's pivot tables, which is why I suggest skipping it altogether, since you can do the same thing and get more CV mileage out of learning SQL.
posted by Blazecock Pileon at 3:52 AM on March 24, 2008

Excel, and more generally spreadsheet are a hammer. They have their place but are over-used. You should learn R if you need to do any statistics and also excellent graphics capabilities, but JMP (which basically SAS lite) is good for simple statistics. For scientific programming, perl and python are both good programming languages (I use perl myself). Zotero for reference management. I wish I could recommend a good word processor, but basically they all suck.
posted by singingfish at 4:16 AM on March 24, 2008

Doing biostat consults, virtually 100% of what I see is MS Word. Just look up what journals you read and are likely to publish in.

Data storage and analysis are totally dependent on what kind of work you're doing. Most labs have a small enough dataset that they plunk it into excel. Other labs I with friends of mine in have data which is only pictures! The team pulling serious cluster time at Argonne has unique methods; other than that if you have about millions of points flat files, SQL, or Access databases are a good idea. At that point it's worth your hour to talk to a CS person who works with databases.

Analysis is similar, and depends on what you're doing. JMP is cool, and learning it you learn some SAS, which is a heavy lifter. Stata is easy to learn, remarkably consistent, and does 99% of what I can think that I would want extremely well. The Stata manuals are also extremely nice and helpful. Programming native Stata is a total pain, but I'm told that their new language (Mata) is very nice and C-like. I used to be an R person, but I am now tired of how it never stops being pain in the butt unless you use it continuously.
posted by a robot made out of meat at 4:41 AM on March 24, 2008

1 & 2. IDL is the defacto choice for data analysis and graphics in my field. If you are working with very large arrays of numbers, IDL is pretty much superior to anything else on the market. On the other hand, if you are not working with enormous arrays, IDL is probably overkill and not worth the expense. It does make fantastic graphs -- worlds better than almost anything else I've ever used, especially Mathematica.

3. You might want to check out LyX. It's a WYSIWYG editor for LaTeX files, and removes a level of complication from LaTeX coding. Depending on how complicated the task, it sometimes slows me down, sometimes saves me hours of time. But it's definitely helpful in many ways. It also integrates beautifully with BibDesk.
posted by dseaton at 5:13 AM on March 24, 2008

Pretty much nthing the above. I used to use LaTeX until it became painfully clear that nobody at all uses it any more. Everyone uses Word - no journals want LaTeX any more. For analysis, I mostly use Matlab and/or R, and then make things pretty in Illustrator.

I use CiteULike for references. Bibim is a simple Mac app that takes a Bibtex file and turns it into an Office 2007/8 format sources.xml.
posted by dmd at 5:49 AM on March 24, 2008

I'm in physics and have spent a lot of effort trying to find software solutions that suit my needs. A good way to take lab notes has been my main concern, and I ended up using MS OneNote for that.

Measurements and data analysis is done in Matlab. My measurement system includes routines that automatically write the important parameters of all instruments, my comments and any graphs produced in human-readable form to OneNote, with a link to a directory that contains all files related to that measurement.

I keep data in Matlab's *.mat file format. It's good if you analyze data in Matlab, as you can load all measurement variables to the workspace in one action. It's also binary, which is necessary since a single measurement may produce gigabytes of data. Were I to do it again, I may have tried Python. It's just that it's capabilities are scattered over many different packages and there is not a lot available for instrument control and data taking.

I realize my approach is Windows specific. What may be useful for you to think about is to have a system to take notes that works well with your system for capturing data. The more notes you take of what you do and think, the happier you will be in the end. Try to eliminate all hurdles that could make you postpone or neglect it.
posted by springload at 8:35 AM on March 24, 2008

Neuroscience grad student here, and I have several friends on the cell bio side of the equation.

(note: just re-read your post and realized you need Leopard compat, so not all this applies, but still worth considering.)

Yes, to Office. If you're just putting together your system now, get Office 2007 (or 2008 if you're a Mac person). It will take some getting used to, but is wonderful once you actually start using it the way it was meant to be used (I have converted everyone I've met from 2003 to 2007 and they all thank me). At first you'll be looking for every little option, but after a while it separates your workflow much more efficiently.

And yes, for citation management most people use EndNote. It integrates with Office, has all the online looking-up whatjamahickies. Another alternative is a Firefox plugin called Zotero, which I have installed but never took the time to figure out.

One note of caution with EndNote, though. I also work for a tech website, so I see my fair share of good and bad apps. EndNote is definitely on the BAD side - buggy, poor error checking, proprietary file format for libraries, etc. I swear, only in academia is something like this even remotely acceptable (users are more likely to take the time to troubleshoot intelligently and minimize any damage from a major failure - in the business and home user worlds the company would go under from support related expenses). Make sure you export your libraries once/year as some standard format, like html or xml, for your archives. 15 years down the line, EndNote will likely be gone, but at least you'll have access to your old citation lists (you know, for reminiscing).

In the past I have tried to use PaperPort and good file naming and folder layout, but that all fell to pieces pretty quick.

Another thing to consider - I loooove having a tablet PC. Even if you grab a cheap one just for reading and taking notes, it is totally worth it. You can find them for <$1k, so the price is right in line with other laptops. I use mine constantly for taking notes in meetings, referencing conversations, mapping out ideas, and highlighting and annotating papers. loveitloveitloveit. If you go this route, get OneNote 2007 and make sure to get Vista. All the tablet underpinnings have been greatly updated and all the junk you hear about Vista being bad is 90%+ FUD. Do no go with the ModBook. Sorry, but forcing you to always use landscape orientation means you lose 80% of a tablet's usefulness on the research side (e.g.-displaying full pages). Since you like Macs, think of this as an accessory device more than another whole system.
posted by neuroking at 8:41 AM on March 24, 2008

posters are made in PowerPoint


I've been both the person producing the poster, and the person at the other end of the process helping people print out their PowerPoint posters to a large format printer. Yes, about 80% of the people in the sciences use PowerPoint for posters. But I've never seen a poster made or printed to large format from PowerPoint that did not have serious flaws. Although PowerPoint can be tricked into doing large format, it's not great and PowerPoint's inability to handle PDF or EPS really hurts when you are printing something at 36" x 48".

If you are going to do poster presentations, learn InDesign, Illustrator or another desktop publishing system. First, just about every data visualization software outside of the world of Microsoft is going to spit out EPS or PDF. Secondly, most DTP software can handle large-format output in some form. And learn how to either create your graphics as vector EPS or PDF or create print-ready high-resolution PNG or TIFF rasters. Because nothing is more heartbreaking that seeing a graduate student representing his or her life's work with fuzzy error bars and blocky jpeg artifacts. You don't need to be a graphic designer, just learn a basic poster layout with a title, author line, and three or four columns of placed text and graphics.

And, actually knowing what "camera-ready" means will give you points with just about every editor on the planet.
posted by KirkJobSluder at 9:19 AM on March 24, 2008

In biology, the answer is basically that everybody uses MS Office, plus Endnote, and Adobe Creative Suite, especially Photoshop and Illustrator. I can't tell you much about chemistry.
Details about what to expect are below, but my strong recommendation is to use whatever software the people you need to exchange data and documents with are using. Don't be the weird LaTeX who emails people files they need to install new software to read, or the guy who can't decipher Word documents with the "Track Changes" feature.
"Data storage and analysis" is too vague to be meaningful. In biology, data are as often as not a bunch of photographs. Most biologists I know use Excel for their numerical data. Depending on the type of data that will be generated in the lab, people will use either Excel or some horrible software that is custom-designed for a particular machine. Or images in Tiff or Photoshop format. Often strange machines in the lab spew data into Excel tables that are full of macros, so you pretty much have to use Excel to analyse them. People also tend to use Excel for lists and tables that would be more sensibly kept in database software or flat text files.
Despite what you may read above, we can and do publish charts that have been generated in Excel, although I prefer Graphpad Prism for graphing. It depends on the type of graph, but a blanket statement that "Excel charts aren't suitable for publication" belies the fact that most charts published in biology journals are generated in Excel.
For writing, almost everyone uses Word, with Endnote for reference management. I would imagine that the majority of professional biologists have never even heard of LaTeX. Given that you will be wanting to send your documents to your supervisor or colleagues, you should stick to the software they use. In biology, that will almost definitely be Word. Note that when preparing documents for publication, we prepare figures and tables as separate documents from the main manuscript. That was also the case for my PhD thesis. 'Inserting' them into the text is a job for page layout software such as InDesign, or possibly Apple's Pages.app, but this is something that you would do after all the writing.
Figures are typically put together using Photoshop, Illustrator or some combination of the two. Many scientists use Powerpoint for this, but that kind of thing shouldn't be encouraged.
For presentations, Powerpoint is king, but Keynote and PDF files work fine (and look better) if you're presenting from your own laptop.
posted by nowonmai at 7:19 PM on March 24, 2008

It really depends on what you are doing, I would recommend waiting to see what your project is and what your supervisors recommend.

For software I have experience with, Se-Al is probably the best DNA sequence alignment program I have used (but is mac only, use bioedit on win). The european bioinformatics institute has pretty much every other tool you will need if you are working with sequences, though if you are really keen you might want to learn bioperl.

In the places I have worked at, the software used generally relies on what output formats the equipment you use supports.

For manuscript preparation, MSoffice is pretty much universal, and your supervisor may object to receiving LaTeX documents. The statistical software again is often institution specific, I have been forced to use R, mathematica, MatLab and SPSS. The best advice I can give you is to pick up a simple programming language so that when you need to rename 500 files with an extra ' to be readable by some horrible propriety program, you can do it quickly.
posted by scodger at 9:42 PM on March 24, 2008

I am in a chemistry graduate student in a group that does physical chemistry, physics, biophysics, biochemistry, etc etc....so we submit to a wide range or journals, and all of them use Word/PDF as the preffered manuscript submission. I find LaTeX has been relegated to hard core physics/math journals anymore.

1. I use OriginPro for all of my data storage, which is Windows only as far as I know, so I am running it on a Boot Camp partition since I made the switch. I have used Excel and Kaleidagraph, and since my lab does not believe in purchasing software since 1999, they are older versions and somewhat crappy (especially Kaleidagraph, good for bar graphs though!) I like Origin a lot but would not be averse to finding something less crashy and mac-compatible. Origin is very handy for storing lots of sets of data in one file and also compatible with Excel imports. There are a few Mac opensource programs I have tried, but I kind of just gave up and stuck with Origin. A lot of people I know use MatLab for plotting. I use Illustrator to prettify all of my plots for publications/posters, and I also make posters in Illustrator.

2. Origin for most stuff, occasionally Kaleidagraph.

3. As I said above, Word. I have been using Papers since the fall for PDF storage and literature searching, and if your experience is anything like the three times I have started research projects, you will be reading a TON of literature. I love, love, love Papers. It currently does not have the EndNote capabilities of making a bibliography, but you can export a "folder" of references as an EndNote library and do it that way. There are a few workarounds and the user base is getting big enough now that people are very good at posting these things on the Papers help boards. It is $25 for students, but totally worth it. The programmers are former graduate students (now post-docs) so they are very responsive to feature requests. I use SciFinder occasionally for literature searches (especially very broad, deep searches) but for general "oh I need another paper by so-and-so" the Papers search functions are super handy and much more accessible than SciFinder. I don't know if you even have access to SciFinder depending on the size of your institution.
posted by sararah at 8:56 AM on March 25, 2008

As far as data storage, use whatever you need to to manage your data and do your analysis. If you can work in one package that's great because it should lessen the number of copies of the data you have laying around (imagine trying to correct the same error in a bunch of files).

However, at the end of the experiment or project you should make sure the data is stored in a way that is accessible for the future. I used to work in a lab that had switched over to PCs from macs and was given old mac floppies containing CricketGraph data to bring into excel(!).
It could be as simple as .csv files or complicated as a database. I would imagine the major formats such as excel will probably *never* go away and they may be safe to use.

One thing, make sure you keep good metadata for your data so it can be understood in the future. I tend to keep a sheet in my excel workbook, reserved for just a short description of the project and field definitions.
posted by buttercup at 7:11 PM on March 25, 2008

As long as the printer is set up right, I've never had an issue with PowerPoint printing on posters. Our Dept has a large (56" wide?) printer and just spits them out in about 15 minutes. Sometimes the color calibration is a little different than what you get on the screen, but not a big deal. The issues come from people embedding an image that was 1000000x1000000 pixels five times and shrunk down to 1x1 inch.

It takes about 5 seconds to export the PowerPoint to PDF (free plug in at MS's site), which works just as well, but still, putting together a poster in PowerPoint is much easier for science types that are used to MS Office. Remember that everyone in the lab has Office, there's lots of cutting and pasting between Office apps, which retains formatting, and really, you make a poster maybe twice/year. Hardly worth learning a new program, let alone paying for x number of licenses of one. If you want to use InDesign, more power to ya, but it isn't really a legit gripe in this case (art and design students yes, bio geeks no).
posted by neuroking at 7:25 PM on March 25, 2008

Response by poster: Wow, thanks for all the responses! I'm pretty excited about the data possibilities (sad, I know!). I'm comfortable with Python and SQL from web development, so storage and plotting looks great. I found an awesome program called Plot that accepts data from MySQL, so I'll try it on one of my labs this week. I definitely want to check out matplotlib, R, and gnuplot too, so I could have an automated system.

That does suck about MS Word though. I wish diff would return meaningful output on binary files :(. Thanks for the info on posters too. I actually know InDesign and Illustrator better than PowerPoint, mainly because MS Office '04 is too slow on Intel :). I'll pick up a copy of '08 this week - four years of development == faster, right? I hope so!
posted by fleeba at 11:09 PM on March 25, 2008

I've occasionally gotten a no-track-changes revision in word; the solution is to use kdiff3 (or cdif, I'm told). Kdiff3 will show you (color-code) the specific changes and word-wrap the result on what are typically too-long lines to meaningfully diff.
posted by a robot made out of meat at 4:38 AM on March 26, 2008

« Older Higher? To the place where blind men see?   |   Why does this feel so much like a legal scam? Newer »
This thread is closed to new comments.