RPy problems
February 25, 2008 9:05 PM   Subscribe

Stumped with RPy, need help badly!

I'd like to use RPy to try to manipulate a Python 2.3.4 list within R 2.4.1. My list is made up of five arrays (four of string-type — "utility", "target", "build" and "timeType" — and one float-type — "time").

My problem is that I can't seem to build a data frame in R. I'd like to, for example, group my data analysis by 'utility, 'target' and 'build' from calls made within Python.

When I use r.data_frame() to create a data frame object, the resulting object is not an R data frame. The following prints "False" on the call from r.is_data_frame():

==========
timeDataFrame = { "utility":[],
                  "target":[],
                  "build":[],
                  "timeType":[],
                  "time":[] }

for timeDataListObj in timeDataListArray:
  for timeDataObj in timeDataListObj.timedata:
    for timeDataType in timeDataTypes:
      timeDataFrame["utility"].append(timeDataListObj.utility)
      timeDataFrame["target"].append(timeDataListObj.target)
      timeDataFrame["build"].append(timeDataListObj.build)
      timeDataFrame["timeType"].append(timeDataType)
      timeDataFrame["time"].append(float(timeValue))

df = r.data_frame(timeDataFrame["utility"], \
             timeDataFrame["target"], \
             timeDataFrame["build"], \
             timeDataFrame["timeType"], \
             timeDataFrame["time"])
r.print_(r.is_data_frame(df))
==========

Another problem I have is with syntax. For example, how can I perform a column reference like df$target or df$timeType?

When I tried to do either:

r.print_(df$target)

or

r.print_(df+r['$']+target)

or

r['print(df$target)']

I get syntax errors. Same with r.split(df$target, df$build) and similar.

The problem seems to come down to how these are interpreted. Either Python misinterprets the r.print_() calls and complains about the $ reference, or when I use r['print(df$target)'], the R interpreter doesn't have any knowledge of the variable df and complains about non-existent variables.

Any advice from seasoned Python/R/RPy users would be greatly appreciated. Thanks!
posted by Blazecock Pileon to Computers & Internet (4 answers total)
 
Best answer: I've used RSruby a litle bit, which is based on RPy. As you're discovering, it's very messy.

In fact, i've found that 90% of the time, it's easiest just to export the data, invoke R, read in and manipulate the data, export back out of R, and then pull the results back into my script. Yes, it's unwieldy, but it's kept me sane.

If that's not an option, try searching the RPy Mailing List Archives
posted by chrisamiller at 11:07 PM on February 25, 2008


I'm going to ditto chrisamiller. I have always found glue from one scripting language to another to be more trouble than it is worth, and much harder to use than reading and writing the data from disk. I'd be happy to share some good practices for exporting data from Python and reading into R, if you want the thread to diverge in that direction. I have been doing that ad infinitum for the last four years or so.

I haven't tried RPy, but it's sufficiently specific that I'm wondering if many MeFites have. So here are a couple of suggestions to help you debug:

What does R think df is? Try str(df), which does its own printing. Also print typeof(df), class(df), and mode(df). Yes, R has three different ways to describe what an object is.

As for your columns, in R, a data.frame is implemented as a list. So my intuition would be, to access it as a list—r.print_(df[target]).

(General slightly off-topic tips: You also get the objects in named R lists with the $ operator. Also, you do not need backslashes in Python to continue an expression that is in parentheses.)
posted by grouse at 11:59 PM on February 25, 2008


Response by poster: I'd be happy to share some good practices for exporting data from Python and reading into R, if you want the thread to diverge in that direction. I have been doing that ad infinitum for the last four years or so.

I'd be interested in advice here. I'm now scanning over my coworker's Python/R-ish scripts and it looks like your's and chrisamiller's advice to read and write data seems to be his approach, as well.

At this point, I think I'll try the csv module, export the CSV file to a temporary stub, and then call commands with r.command to import the file and plot its various data.

Thanks for the object advice — that'll definitely come in handy when debugging.
posted by Blazecock Pileon at 12:13 AM on February 26, 2008


Best answer: I recommend using tabdelim my (self-link!) textinput package. It's just a wrapper around csv but it just makes it simpler to write tab-delimited output. import tabdelim and you're done. You can just run easy_install textinput if you have setuptools installed (highly recommended).

Anyway, then you can just produce your output with something like
from tabdelim import DictWriter

COLNAMES = ["utility", "target", "build", "timeType", "time"]

writer = DictWriter(sys.stdout, COLNAMES)

for timeDataListObj in timeDataListArray:
    for timeDataObj in timeDataListObj.timedata:
        for timeDataType in timeDataTypes:
            row = dict(utility=timeDataListObj.utility,
                       target=timeDataListObj.target,
                       build=timeDataListObj.build,
                       timeType=timeDataType,
                       time=timeValue) # I assume timeValue is already a str

            writer.writerow(row)

In R, you can read the data file in with read.delim(filename). All the column names are taken care of for you with a minimum of fuss.

If you have more complicated data structures or large files (hundreds of megabytes), then try PyTables with the hdf5 package for R. It works really well and is very fast.
posted by grouse at 1:42 AM on February 26, 2008


« Older I need fresh views on money and wealth   |   Getting started in graphic design Newer »
This thread is closed to new comments.