How to deal with mixed granularity in a fact table?
March 22, 2011 3:58 PM   Subscribe

Data warehousing question: how to combine data that have slightly different granularities?

I'm trying to load data from several different, but very similar, data sources into my fact table. The data is student enrollment data for public schools. Some schools report enrollment for both male and female students, and some only report the total number for both. The question is: how can I get all of this data into the fact table with the same granularity. It seems I have a few choices:

1. Create a separate fact table with Sex as a dimension, keeping total data integrity, but losing the ability to view all of the data together at the lowest level of granularity (rollups will still work, since a total is a total). Prognosis: a pain, but essentially doable.

2. Omit the male/female split in the data and store only the total enrollment. Prognosis: bad, don't want to lose the dimension.

3. ???

The issue is that there are other dimensions where the same holds: some schools report demographic data such as ethnicity and some don't. Will I have to create separate fact tables for all of these instances? Or is there some way they can all live happily together?
posted by vraxoin to Computers & Internet (5 answers total) 1 user marked this as a favorite
I would go with option 1 - where your key analysis is on the total, but if needed, someone could take a sample size with the different dimensions as factors. Whether that analysis will be statistically significant is a different question altogether. My personal opinion would be to have all data, regardless of whether it is useful now.
posted by theobserver at 4:14 PM on March 22, 2011

I would add 'unreported' as a third value for the Sex dimension, and similarly for the other dimensions where it's an issue. It's fine to do a separate fact table if there were just going to be one, but each additional dimension that somebody doesn't report would double the number of fact tables you'd have to create, and that seems like it'd get unmanageable very quickly.
posted by inkyz at 4:25 PM on March 22, 2011 [3 favorites]

I'd make the sex dimension an integer, use 1 for male, 0 for female, and -1 for unreported. That should be easy to run stats on in future.

Of course that may be cause some head scratching for your successors, but you'll have that job forever, right?
posted by clarknova at 4:49 PM on March 22, 2011

Generally I'd handle this as a missing data problem. I'd keep the database at the smallest granularity of data you have, but also include columns for aggregate data, whether its supplied by you or the datasource directly. Then you end up with something like:

| Male | Female | Total |
| 33 | 45 | 78 |
| NULL | NULL | 62 |

What you do with missing data depends entirely on your uses. You may want to ignore those columns, you may want to estimate them, etc. All stats programs provide good mechanisms for missing data.

If you have reasonable data sizes, this won't be any problem and will give you the most flexibility for working with the data (say, if you're using R or something). If you have truly huge amounts of data, I'd be writing custom accessors for the data sources that would provide the same information without any data duplication or preprocessing (but likely caching anyway).
posted by devilsbrigade at 5:38 PM on March 22, 2011 [1 favorite]

Probably should mention: there's a gigantic amount of literature about missing data, and even some about differences in aggregation. See here,here, and here (unfortunately not free).
posted by devilsbrigade at 6:38 PM on March 22, 2011

« Older Where to stay between Chicago and Miami?   |   Books on child development? Newer »
This thread is closed to new comments.