Best practices for scientific Python prototyping/debugging?
January 15, 2013 1:50 PM   Subscribe

I'm using Python in my research now to build some models and analyze some data. I've done this with Matlab and R in the past and became fairly used to running individual lines or blocks of code at a time and doing a lot of interactive printing and plotting from the console. This behavior seems to be a bit more difficult in Python and I'm trying to find the best way to be productive.

The behavior in Matlab is called 'Cell Mode' and it is common in a lot of scientific software. It lets you demarcate code regions with a special comment character, and with Ctrl+Enter you run the current cell where your cursor is located. Cells are kind of like methods but you don't call them -- you move the cursor to run individual cells, or you run all the cells in a file sequentially. Variables within a cell persist after a cell is run.

For complex and large tasks it makes sense to write more 'proper' code, separating data processing and analysis steps into separate methods or files, but most of the time this is overkill, and it makes it harder to do interactive data exploration and plotting.

I currently use an IDE called Spyder which is recommended by other scientific Python users. It has a feature called 'Run selection or current block' which is supposed to replicate this, but it is buggy and breaks if you use loops. I've been looking for another IDE that might have this feature and come up empty and don't have time to try a bunch. Any suggestions?

Alternatively I have been trying to get into a more traditional programming workflow, separating my code into functions and using breakpoints and debuggers when I need to do interactive plotting and printing. I get frustrated by scoping -- variables within functions are local, of course, but I often want to be able to print and plot them as I'm writing the code so I can tweak things, and the only way I can get access to them is to stop execution within that method, but I find breakpoints and debuggers clunky. Maybe I just need to get better at this. I often can't get things to work the way I want, I use a lot of global variables, I find I want to interactively access local variables from other methods that I thought I wouldn't need to, and in general I am wasting time and headspace on this problem when I want to be focusing on my data. What are the best practices here?
posted by PercussivePaul to Science & Nature (11 answers total) 14 users marked this as a favorite
John Cook talks a little about using emacs with python, along the lines of what you describe.
posted by Blazecock Pileon at 2:10 PM on January 15, 2013

Cell mode in Python: StackOverflow advises.

You might also like: Cache debugging.

(Caveat: haven't tried any of these myself.)

My own preference is the one you outline in your last paragraph, probably because I'm a programmer who went into science rather than a scientist who started programming. Global variables give me a bad feeling for anything bigger than a throwaway script. Generally what I try to do is write a Python module which wraps up any code I'm going to use repeatedly, then import it in my processing scripts. I try to make my scripts look like the simplest possible pseudocode for the problem I'm attacking, with all the ugly details hidden in the module where they won't distract me. And of course I can import and use that same module in ipython when I'm exploring stuff interactively.
posted by pont at 2:29 PM on January 15, 2013

As a Matlab user, I never found a use for cell mode and strongly prefer the "hover-over" variable state popups in keyboard/debug mode. What about cell mode are you trying to replicate that your favorite IDE doesn't offer? When I started playing with C in Eclipse, I found the variable state window and step-through mode to be a very serviceable replacement, although the stuff I write in Matlab is a lot more complicated than what I cobble together in C. Are there some unique features of cell mode that your workflow relies on?
posted by Nomyte at 2:45 PM on January 15, 2013

Check out the ipython notebook.
posted by qxntpqbbbqxl at 2:50 PM on January 15, 2013 [3 favorites]

This behavior seems to be a bit more difficult in Python

I'm only offering this observation as a data point, but I write Python from within emacs and was never aware that this was even a problem. Have you considered joining our church?
posted by lambdaphage at 2:51 PM on January 15, 2013 [2 favorites]

I use python in the way you describe. There are many ways to get there so I'll just describe my standard setup. Other ways may be better or easier out-of-the-box. I'm often logged in to remote supercomputing sites where I only have shell access, and that has dictated the requirements of my setup to some extent. For example, I've shied away from IDEs and things that are difficult to install on oddball operating systems, because I'm often dealing with oddball operating systems.

The bare minimum is Numpy + IPython + Matplotlib. Numpy gives you fast array operations, IPython gives you a comfortable interactive shell, and Matplotlib gives you nice plotting routines. They're more or less copied from Matlab so things should feel familiar.

I run IPython inside emacs (as mentioned above) to get behavior like the cell mode you mentioned. There are commands to do this, you just have to bind them to key combinations you like. Using IPython inside emacs also results in emacs jumping to the point in your source that threw an exception when an exception is thrown.

Finally, Scipy provides a lot of standard algorithms like FFTs, ODE intergrators, linear algebra, and such. It used to be a bear to install, but that was ~8 years ago and I haven't had a problem in a long time.

If you do parallel computing, IPython has very interesting facilities to, e.g. start a Python instance running on, say, 100 remote CPUs and then communicate with them. It's like you you have 100 CPUs sitting under your desk. Interactive supercomputing.
posted by ngc4486 at 2:58 PM on January 15, 2013 [1 favorite]

If you do parallel computing, IPython has very interesting facilities to, e.g. start a Python instance running on, say, 100 remote CPUs and then communicate with them. It's like you you have 100 CPUs sitting under your desk. Interactive supercomputing.

Could you say a little more about this?
posted by lambdaphage at 3:28 PM on January 15, 2013

Could you say a little more about this?

It's been a long time since I've used the parallel facilities of IPython heavily, so I'll have to direct you to the documentation, specifically the examples for an up-to-date overview of it's capabilities.

I can tell you about how I used it, though. I was processing snapshots from a simulation where I computed some quantity for each snapshot and then made a plot of the quantity as it changed through the simulation. Each snapshot didn't take very long (maybe ten seconds) but there were a lot of snapshots so it took an hour or two to run through all of them on a single processor.

With very little trouble I was able to set things up so that when I started IPython, it started IPython instances on each node of a ~100 node cluster, loaded a bunch of Python modules, and then awaited my commands. Then at the IPython prompt I could pretty concisely say things like: "Send one snapshot ID to each node and run this function on the corresponding snapshot." The results came back as a Python list and I could draw a plot or whatever. And the results came back in seconds rather than an hour. Let me tell you, there's an enormous psychological change that happens when your turnaround time goes from hours to something interactive, like seconds.

It's true that you can duplicate this particular use case with shell scripts, ssh, and pipes. But it's hard to do it in a way that allows the same flexibility. It's also a pretty awesome feeling to know that your python prompt has a live connection to 100 or 1000 processors just waiting to do your bidding.

However, using a cluster this way basically depended on the department having just bought a newer bigger, cluster, and thus no one much cared what I was doing with the old one. This gets to a long standing gripe of mine about how supercomputers are usually set up. When someone spends a million bucks on a computer, they want to be sure that all of the processors are in use all the time. So they put a batch queuing system in front of the processing nodes. Then if someone thinks "Gee, I'd like my data analysis to run faster, maybe I should parallelize it," the response is "Well, first you have to learn MPI, then you compile your program differently, then you learn how to use PBS, then you write a batch script, then you submit your script to the queuing system, then you wait for it to run, and then you get your data!" So the natural response is "Umm, well, that sounds complicated, maybe I'll just get coffee while my program runs on one processor."

If, on the other hand, you make it easy for people to take their existing mostly serial code and say "Just run this part in parallel and farm out one image to each processor," then you can just as well fill up a big machine with these sorts of short tasks. And everyone feels as though they've got 1000 processors sitting under their desk, because when I say "go" and use the whole machine for a ten second slice of time, you'll be looking at the plot you just drew. Then when you say "go" and use the whole machine for ten seconds, I'll be looking at the plot I just drew. I think this would be a net win because of the enormous psychological difference between "this takes an hour to run" and "this takes a few seconds to run."
posted by ngc4486 at 5:14 PM on January 15, 2013 [1 favorite]

Via pont's StackOverflow answer (somehow I did not find that when I searched) I find that IEP is a much better IDE for my purposes and has a great implementation of cell mode. It's also snappier than Spyderlib.

Letting go of IDEs and gaining some proficiency in emacs is something I might do when I have a bit of time in the future as I can plainly see there are some higher-productivity modes to be reached. Thanks!
posted by PercussivePaul at 6:40 PM on January 15, 2013

I often can't get things to work the way I want, I use a lot of global variables, I find I want to interactively access local variables from other methods that I thought I wouldn't need to, and in general I am wasting time and headspace on this problem when I want to be focusing on my data. What are the best practices here?

I know what you mean. Without knowing ahead of time what you're even doing, it's quite difficult to impose any sort of meaningful structure on the code, and it's easy to fall into the flexibility trap of making everything global just in case you might need to reference it in some other scope.

I've picked up a few tricks for how to handle this in a somewhat sane fashion, but the best practice I know of is to periodically rewrite your code in a more formal fashion as your design requirements converge. It takes time, and it's probably not very interesting compared to your research, but the payoff when you need to verify that your code works or you want to get colleagues involved in your work is worth it.
posted by RonButNotStupid at 7:10 PM on January 15, 2013

Some of the tricks I've picked up. I tend to start off with these to get something working, then as I have a better idea for the scope of the project, I try to replace them with something a little more professional.

1) Take advantage of the mutability of objects. Create an empty dictionary in the global namespace, then pass that object to a method and have all the variables declared within that method be added to the dictionary. Because dictionaries are mutable, when the method has finished executing, you can interactively examine the dictionary and see all the variables you placed there in the method.

2) Instead of passing individual variables to a method, group them together in a custom class and then pass objects from that class to the method. Python's built-in mechanisms for creating getters and setters make a great hook for automatically outputting the value of a variable to the console whenever it's accessed or modified.

3) Use nested functions. If a function requires a ton of parameters, sometimes it's easier to just 'create' the function within the namespace that has those parameters instead of worrying about passing them as arguments every time the function is called, especially if you intend to call the function in a namespace that doesn't have those parameters.
posted by RonButNotStupid at 7:28 PM on January 15, 2013 [2 favorites]

« Older Is this toddler sadness reasonably normal?   |   Does anyone make it back from the brink? Newer »
This thread is closed to new comments.