Why do my Python scripts not finish?
June 11, 2019 12:03 PM   Subscribe

For a hobby project, I run hourly Python scripts via a cron job. These scripts always run correctly (ie produce expected output), but sometimes one of them doesn't terminate and the Python process lives on, hogging memory until manually killed. Can you help me troubleshoot this?

I'm running Python3 (Anaconda build) on an Ubuntu remote server. The script runs Pandas and Matplotlib code to read in CSV data, do some processing, and draw some plots.

If I run a `ps` command after a couple days, I see some of the processes haven't completed:

$ ps -eo pmem,pcpu,etime,vsize,pid,cmd | sort -k 1 -nr | head -5
4.4 0.0 8-01:37:37 801820 15556 /home/msj/miniconda3/bin/python3 /home/msj/mobi/mobi.py --plots /data/mobi/data/
4.4 0.0 4-03:37:37 793808 8619 /home/msj/miniconda3/bin/python3 /home/msj/mobi/mobi.py --plots /data/mobi/data/
4.2 0.0 6-00:37:37 793432 15083 /home/msj/miniconda3/bin/python3 /home/msj/mobi/mobi.py --plots /data/mobi/data/
4.2 0.0 5-02:37:37 793252 26810 /home/msj/miniconda3/bin/python3 /home/msj/mobi/mobi.py --plots /data/mobi/data/
4.0 0.0 7-23:37:37 789124 19969 /home/msj/miniconda3/bin/python3 /home/msj/mobi/mobi.py --plots /data/mobi/data/

I can't recreate this problem outside of the cron job -- when I run the script by hand in the terminal it completes fine. Even the cron jobs complete *most* of the time, only failing to complete ~10% of the time.

I use python a lot (but am not an expert by any means) and this is the only time I've run into this issue. I have a bunch of other scripts that run via cron jobs in the same environment, and only the one script has this problem.

I know there's not enough information here to really debug this but if this is something you've seen before, or if you have any suggestions of things to try, I'd really appreciate any advice.
posted by no regrets, coyote to Computers & Internet (4 answers total) 3 users marked this as a favorite
The first thing to try is to attach to one of the stuck processes in a debugger and see if there are any clues to be had from the backtrace. You'll need root privileges on your remote server to do this. Here's an example where the Python process is just waiting for keyboard input:
$ sudo apt install gdb # if necessary
$ sudo gdb
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
(gdb) attach 29156
Attaching to process 29156
0x00007fd6abc48573 in __select_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
84	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) backtrace
#0  0x00007fd6abc48573 in __select_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fd6aa65ccc7 in ?? ()
   from /usr/lib/python2.7/lib-dynload/readline.x86_64-linux-gnu.so
#2  0x000000000044a9a7 in PyOS_Readline ()
#3  0x00000000004255c9 in ?? ()
#4  0x00000000004b4235 in PyParser_ASTFromFile ()
#5  0x000000000044a756 in PyRun_InteractiveOneFlags ()
#6  0x000000000044a58d in PyRun_InteractiveLoopFlags ()
#7  0x0000000000430b76 in ?? ()
#8  0x00000000004938ce in Py_Main ()
#9  0x00007fd6abb6b830 in __libc_start_main (main=0x493370 
, argc=1, argv=0x7ffca0174c78, init=, fini=, rtld_fini=, stack_end=0x7ffca0174c68) at ../csu/libc-start.c:291 #10 0x0000000000493299 in _start ()

posted by cyanistes at 12:31 PM on June 11 [2 favorites]

Do the scripts fail to complete if you remove all the plotting code? You may need to throw in some plt.close() (or plt.close(fig) if there is more than one figure). I've seen matplotlib launch extra python processes that stick around.
posted by caek at 2:08 PM on June 11 [3 favorites]

The first thing to try is to attach to one of the stuck processes in a debugger and see if there are any clues to be had from the backtrace.

If you go this route, you'll probably want to install the Python GDB extensions. In particular, this package adds a "py-bt" command that acts like "backtrace", but extracts a Python-level stack trace from the interpreter's data structures, instead of just showing you what part of the interpreter each stack trace is in. This is likely to be a lot more useful for tracking down the problem.

A simpler way to get a stack trace is to just kill the process with SIGINT, which is equivalent to pressing Ctrl-C in a terminal. (This will give you a stack trace on stderr, but it only works as long as nothing else catches the signal or the resulting KeyboardInterrupt exception, and as long as it isn't stuck in an uninterruptible system call.)
posted by teraflop at 3:13 PM on June 11 [1 favorite]

I appreciate the gdb recommendations since I hadn't heard of it, but adding in a `plt.close('all')` at the end of my script seems to have fixed the problem. Thanks!
posted by no regrets, coyote at 2:55 PM on June 12

« Older On the fringes of the Fringe   |   Help me frame a shift on how I work to my client Newer »

You are not logged in, either login or create an account to post comments