Reflections on the Kimball Data Warehouse "Bible": Time for a New Testament?

| | Comments (7)
While preparing for a seminar course I'm giving on Data Warehousing this Fall, I reviewed the The Data Warehouse Toolkit (2nd Ed) by Ralph Kimball and Margy Ross (the Amazon.com link is here). Long considered the Bible of data warehouse design, this book offers several schema design strategies for relational database OLTP engines used for data warehouse applications. The Data Warehouse Toolkit was a seminal contribution to the field of data warehousing when it was first published in 1996, differentiating the needs of warehouses from operational stores and inventing "dimensional modeling" as a schema design methodology. But in reading this text 10 years after it was first published, I found that it revealed OLTP database engines to be ill-suited to meet the needs of data warehousing applications.


Reevaluating the book ten years later

For those unfamiliar with it, I would summarize The Data Warehouse Toolkit as a collection of schema design strategies that fall in two categories:

  • Performance tuning strategies

  • Strategies designed to make desired warehouse queries expressible in SQL


"Dimensional Modeling" is an example from the first category and is the most emphasized design strategy. An alternative to standard normalization techniques taught in introductory database courses (e.g., 3NF, BCNF), dimensional modeling states that:

  • Dimension tables should be denormalized (i.e., prejoined with their subdimension tables) to allow common warehouse queries such as drilldown and rollup queries to be evaluated without having to perform additional joins.

  • Fact tables should be kept normalized so that they are "skinny," therefore requiring fewer blocks on disk (and by implication, fewer block reads when scanned during queries).


Design strategies from the second category include:

  • Defining Date as a dimension table rather than using SQL's built-in Date type (this allows queries to invoke a richer set of operations on Date than is provided by the SQL Date type, with these operations manifested as columns in the Date dimension table, such as whether or not a given date is a holiday).

  • Adding additional columns or rows to a dimension table to capture the history of a how a particular entity has changed over time (This allows queries to separate data according to this change, such as analyzing sales at a store before and after it changed its location).


Reading this book through my Vertica/C-Store prism, I had the following reactions which I contend reveal the deficiencies of a traditional relational row-store OLTP engine when used for data warehouse applications:

  1. Confusing logical and physical schemas

  2. Dimensional modeling is a row-store methodology

  3. Traditional DBMSs have poor support for abstract data types

  4. Traditional DBMSs have poor support for time travel


Confusing logical and physical schemas 

Data independence is the well-known principle that establishes that logical and physical representations of data are independent of one another. As a result, a given query over a logically defined schema (CREATE TABLE) returns the same answer, regardless of the availability of indexes, partitioning strategies, data layout on disk blocks, etc.

Beyond indexing, partitioning and data layout, an equally valid application of data independence says that a table's physical representation does not have to match its logical definition. To this end, denormalization (being a performance tuning strategy) should be a physical schema design strategy and not manifested in the logical table definitions.

This argues for DBMS tools that accept normalized logical schema definitions and characterizations of expected workload (e.g., sets of queries) as inputs and produce appropriate physical schemas as output. Automatic database design tools are used by SQL Server, DB2, and Oracle to assist in the selection of indexes and materialized views. Vertica's DB Designer tool takes this one step further to produce physical schemas, which often are selectively denormalized versions of their logical equivalents. It is clear to me that such a design tool is required of any DBMS used for data warehouse applications.


Dimensional modeling is a row-store methodology


One argument that favors the separation of logical and physical schema design comes from dimensional modeling. Dimensional modeling favors denormalization of dimension tables, but not fact tables. This distinction is made to ensure that fact tables are kept "skinny," because otherwise the fact table scans that are required in most data warehouse queries become too I/O intensive.

Note that this argument is specific to row stores. In a column stores, adding columns to a fact table simply requires adding additional files representing those columns. Any queries that do not use these columns need not touch their associated files. In other words, denormalization of a fact table is free in a column store, whereas it incurs significant I/O costs in a row store.

This distinction gives column stores a tremendous advantage over row stores. Denormalizing dimension tables reduces the number of joins that must be performed to answer a query, but the joins that are eliminated are the cheaper ones involving smaller (dimension and subdimension) tables. The most expensive joins will typically be those involving the fact table, and so elimination of these joins by selective denormalization of the fact table is a huge performance win.


Traditional DBMSs have poor support for abstract data types


The modeling of Date as a dimension table hits me as a workaround for the absence of an effective and efficient means of defining abstract data types in most DBMSs. The forerunner of object-oriented language "classes," an Abstract Data Type (ADT) is a type specification that includes definitions of operations over instances of that type. Given an ADT mechanism, a rich date type could be defined with operations that determine whether or not a given instance is a weekday, holiday, etc. Indeed, ADT's would be useful to meet the needs of any data warehouse application requiring richer data types than are provided by an OLTP engine.


Traditional DBMSs have poor support for time travel


The requirement to compare aggregate results before and after a change to an entity argues for a data warehouse solution to support "time travel." Oracle, for example, has a "FLASHBACK" option that allows a database to be queried as of a previous point in time. This is effectively supported by applying the UNDO log to a database until the database is rolled back to an earlier state upon which the historical query can be issued. Vertica provides equivalent functionality that depends upon an architecture that doesn't overwrite data in response to an update. Note that both of these solutions do not require altering the logical schema as is advocated in The Data Warehouse Toolkit. Rather, support for time travel is provided "under the hood" and made transparent to the application.


Summary

In short, The Data Warehouse Toolkit by Kimball and Ross offers valuable insights into how to squeeze data warehouse functionality out of an OLTP engine. In this respect, it was pioneering work in data warehousing. But that it advocates that DBAs express these strategies in their logical schema designs reflects the fact that the engines that DBAs are tuning were originally designed and built for transaction processing and not data warehousing. By using an engine built from the ground up for warehousing needs, DBA's can get significantly better performance while expending far less of their resources.
 

 

Categories

7 Comments

N Campbell said:

I would be interested to see what else Vertica is doing to enrich SQL to better support BI queries. For example, some of the issues identified by ORACLE that resulted in them introducing the model and group-by-outer clause etc.

David Docetad said:

"In other words, denormalization of a fact table is free in a column store, whereas it incurs significant I/O costs in a row store."

Well, it may have a different cost in a column store, but it's hardly "free". There are, at least, storage costs and update costs.

Mitch Cherniack said:

Thanks, David, I should have been more clear. The benefit of denormalizing the fact table in a column store (as opposed to a row store) is that it adds no I/O cost to fact table scans for queries that don't touch the added columns, while still saving on join costs for the queries that do. Of course, by adding data redundancy, denormalization can result in added update and storage costs regardless of the underlying storage architecture.

adolf garlic said:

The second edition was published in 2002. Hardly ten years old.

There is also a Microsoft specific book available, published in 2006

The Microsoft Data Warehouse Toolkit: With SQL Server 2005 and the Microsoft Business Intelligence Toolset
John Wiley & Sons, 2006 (720 pages)
http://www.amazon.com/exec/obidos/tg/detail/-/0471267155/qid=1128605285/sr=2-1/ref=pd_bbs_b_2_1/104-0775321-0612745?v=glance&s=books

anonymous said:

Can one of the contributers provide any disadvantages of using a column based database?

So far, the only one I've come across is that writing complete rows of data is slower given that data needs to be appended to 5 columns individually (i.e. 5 files).

Surely this would affect load times during already constricted windows?

Thanks

mark said:

I agree that dimensional modeling is/was largely driven by the row-oriented nature of databases, but I think you missed several items in this assessment.

Performance and SQL expressibility are two design components but you missed the third: schema legibility. In the era of SQL-based tools, user legibility is important for both end users and report developers. A star schema means there is only one way to pose a question, with no cyclic joins or confusion over where to go to constrain (dims) or get measures (fact tables).

The primary reason to not denormalize the fact table isn't to keep the fact table skinny, though that does help performance. The primary reason is granularity and isolating the fact row to a single event or transaction with only its related set of measures.
Given a 1-N relationship from the fact to all dims avoids confusion around double counting as well as optimizing for join and retrieval performance.
Separating the dims and facts also creates logical isolation of data, simplifying the ETL process.
It sounds as if you are advocating a big wide table model using a column store, rather than a star (I'm ignoring the normalized vs. star debate here). Is that true?

I think you misunderstand an aspect of time dimensions as well. From a straight calendar perspective you're right, in that better ADT features would allow access to all the attributes from a date column.
A bigger problem is that businesses don't normally operate on a regular monthly calendar. They operate on fiscal 445 (or similar) calendars, starting their year on often manual/arbitrarily chosen dates, and must track not only 445 and 13 week periods driven by the start date, but also deal with leap and 53rd week corrections and the occasional "finance wants to reset the calendar" tasks.
That's a pretty hairy problem to address in an ADT.

Don McMunn said:

Among other things, I have been waiting for abstraction of built-in data types of all kinds in major vendors. With the huge, rich data types so prevalent in our world today represented by video, audio, GIS, millibar maps, text, etc... you'd think we could manage to have vendor JV's and/or alliances with enough resources and clout to make more OO features available from our COTS relational database products. Capitalism and greed continue to rule the day and shareholders win, but practitioners lose.

Along with gaining useful access to database optimizer query history profiling metadata, I agree that a more highly performant, more granular, tunable and configurable OO extension for SQL is over 15 years late. Maybe it is time for a change, but I've gotten too old and slow to make it happen at this point, so that's the new generation's problem to solve now.

In the interest of fighting capitalism, I have made my date dimension toolkit free and downloadable at: http://www.ipcdesigns.com/dim_date along with a few other tidbits from 10+ years of making BI and DW "happen" in the field for my Fortune 500 customers. I can't change the vendors, but I can help those of us who remain in the field doing the day to day while all the "thought leaders" blow smoke up each other's skirt. Data warehousing 2.0? Most (thankfully not all) of the businesses I have worked for can barely get clean data from their own internal systems and are still fighting spreadmarts like it is still 1995! Cynical? Certainly not! Keeping it Real in Tennessee, Peace!

Leave a comment

About this Post

This page contains a single post by Mitch Cherniack published on October 3, 2007 9:58 AM.

Disk trends will drive the need for column stores was the previous entry in this blog.

Thoughts on the VLDB conference is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.