Lots of fiddly and not-so-fiddly data for research: Is OLAP my answer?
June 11, 2007 9:16 AM   Subscribe

Is OLAP my answer? Someone on a stats listserve I subscribe to mentioned OLAP in passing as an approach for managing lots of smaller data tables in a sane/integrated way. I've tried googling around and reading the info on various OLAP/BI project pages. I've since looked briefly at the Pentaho suite (Mondrian, and what was Kettle et al), JasperIntelligence, SpagoBI, and Palo OLAP...briefly only because I'm not quite that technically adept to see for sure how they work and whether these types of applications do what I want them to, given that I'm not coming from what I imagine a "typical" BI user's perspective/needs might be. At this point, my head is spinning a bit.

They all seem nifty enough on the surface, but I say that from a relatively uninformed perspective. I'm a researcher who tries to learn the tools I need to to get by, not a true techwiz or database guru. I'm hoping to learn enough here to intelligently negotiate what I want with our tech folks, or at least be able to have a good conversation with them about which solution we might want to pursue. So...is OLAP for me? Or am I looking for something else? Any specific recommendations? (open source strongly preferred).

I'm a researcher with an organization that by necessity accumulates lots and lots of data ranging all the way from raw data tables (public use data as well as collected/created by us) to many fiddly highly-distilled cross-tabs produced by other organizations. Right now folks in our group find and accumulate data more-or-less individually, and we don't have a great system of central organization to stash it all, know what we've got, keep it clean, and quickly pull it into nice usable formats. Given the rate of expansion (of both our org and the data we need to manage!) I see this situation going from less-than-optimally-efficient to downright problematic in the nearish future. At minimum I want to have one "go-to" place for all of the fiddly tables, be able to document that data, and put it there and pull it back out easily. At best I'm hoping for something that can grow into serving for backend management of and frontend access and connectivity to all of the data we manage.

More detail on what I'd ideally like something to do:

1. Let us easily upload and define new tables of data with varying amount of dimensionality and pre-aggregation (ranging from lots of dimensions highly aggregated all the way down to raw data in flat files). When I say "easy", I mean that I need to be able to train undergrads or grad students in non-technical fields to do this part: assume bright but inexperienced - hence one of the reasons why I'm not just looking at throwing everything into a big database. Should be able to take data from delimited text, Excel, and common database formats.

2. Associate an extensive, customized set of metadata with each table (doesn't necessarily need to be different schema per table, but we're going to have a lot of our own fields from a research perspective that go beyond what a typical BI application would care about). E.g., could easily be from 50 to more than 100 fields.

3. Search for data tables based on fields in the metadata schema.

4. Be able to both (a) create quick calculation/crosstab tables from various tables within the repository (singly or perhaps in combination) and output those in useful formats as well as (b) extract data tables in whole or in part through ODBC/JDBC connections and queries from within a statistical computing package like R or SAS for more in-depth analyses. I'm hoping the same kind of connectivity would make it easy for other web applications to access particular views of the data on the fly if that served our purposes down the line.

5. Super-bonus: be able to do the same as #4 but with the metadata - i.e., pull out associated metadata and combine into a new metadata record to go with the resulting analysis.

6. Have fairly robust security, and multiple levels of it, so that different users (internal vs. external partners, staff vs. students, etc.) could have differential access to both particular data tables and what they can do with those tables.

Given all of this, what will get the job done for me? Also, are there considerations or features I should care about but don't know enough to yet?

posted by shelbaroo to Computers & Internet (1 answer total)
To answer you way too briefly: OLAP isn't the most suitable keyword search for your list of requirements. Given your demand for ease of metadata management, perhaps a 'warehouse builder' search would be helpful.

OLAP is chiefly good for rich, multi-dimensional analysis of data. Of the links you cited, Palo OLAP seems to best represent a typical OLAP server. Note that in both JasperIntelligence and Pentaho BI suite, OLAP is just one part of their architecture diagram.
posted by em at 4:39 PM on June 11, 2007

« Older Why is a hard drive preventing my computer from...   |   What to do while playing world of warcraft? Newer »
This thread is closed to new comments.