Datawarehouse 101
August 28, 2007 2:22 AM   Subscribe

I am changing jobs in a couple of months, and need to know more about database design, and more specifically datawarehouse and datamart for reporting systems.

I will be working on the ETL and reporting side, but would like to sound more knowledgeable about general database design techniques, such as normalisation, star schemas and so on, with particular emphasis on datawarehousing. The last time i looked at this Ralph Kimball was the datawarehouse reference point. Would it still be worth reading his Lifecycle Toolkit? Or is there somebody new out there that is a must-read? Online or hard-copy recommendations would be most welcome.
posted by jontyjago to Computers & Internet (2 answers total) 3 users marked this as a favorite
 
Learn by doing. Try designing a database to record information about songs, e.g., artists, publishers, albums, composers, compositions (the "ideal" songs) and their covers (the "instances"/tracks of songs). this quickly becomes non-trivial. Make sure you design it in at least 3rd normal form. Explore the controversy between using synthetic keys and natural keys by implementing two versions of your database, one with each.

Once you have your database, do some "warehousing": what's the median track count of songs per genre per decade? Which composers have the most covers (in unique tracks; don't count tracks twice, i.e,, that are republished on "Best Of" or compilation albums)? The most artists covering them (i.e., don't count a different version of the cover by the same artist)? Of artists in "bands" (i.e, , several artists together on a song, e.g, John, Paul, George & Ringo made up the Beatles), what percentage of artists have more than three solo albums in the decade after the band has disbanded? (Can you even determine when a band has disbanded by examining your database (this is a surprisingly subtle thing to map)?

Learn that, whether you're doing ETL, reporting, or warehousing, views are your friends. Get in the habit of writing views for anything you do; the view will serve to functionally decompose your queries, and will provide building blocks for more complex queries. "First, build a view."

Learn the value of indices, both for efficiency and for semantic correctness.

Finally, learn what I mean when I say, "If you lie to the database, the database will lie to you." Explore how poorly-conceived database structures can retain unimportant information, lose vital information, or worst, make wrong outputs out of corrrect inputs.
posted by orthogonality at 4:12 AM on August 28, 2007 [4 favorites]


I think Kimball is still one of the big guys in the data warehousing field (though I'd be inclined to recommend his Data Warehouse Toolkit before the Data Warehouse Lifecycle Toolkit). The other person to look at is Bill Inmon, and there's generally considered to be a rivalry between the Inmon and Kimball approaches. You can spark some, um, vibrant discussions by just mentioning the word "denormalization"...

One thing to keep in mind is that in a denormalized data warehouse, it's up to the ETL process to ensure data integrity, since you've kind of given up on having the database do that for you. In an ideal world (with truly relational database platforms and no performance concerns), there would be no need to do that sort of thing, but in reality it is sometimes necessary. But you need to know what you're getting yourself into.

In addition to the modelling issues that orthogonality mentions, you also need to look at optimizing for getting data out of the warehouse. That is, your schema needs to be set up in a way that allows reports on huge amounts of data to run in a reasonable amount of time. Queries can often be sped up by running them on precomputed aggregate data, but then that puts even more burden on your ETL process to ensure that the aggregate data is always identical to the non-aggregated data.

So I guess the point I'm trying to make here is that data warehouses have a whole mess of issues that are very different from those in OLTP systems, and the most important thing is probably to familiarize yourself with those differences (and the controversies surrounding how to deal with them). Implementation specifics are constantly changing (and the best place to learn about a lot of these is probably on the job anyway), but the main principles haven't really changed that much. So I'd say if you start with Kimball and Inmon (sorry, I don't know what to recommend by Inmon), that should give you a pretty good background.
posted by klausness at 5:14 PM on September 9, 2007


« Older Brew pubs in London?   |   What to do with a thieving adult child? Newer »
This thread is closed to new comments.