Understanding the Difference Between Column-Stores and OLAP Data Cubes
Both column-stores and data cubes are designed to provide high performance on analytical database workloads (often referred to as Online Analytical Processing, or OLAP.) These workloads are characterized by queries that select a subset of tuples, and then aggregate and group along one or more dimensions. For example, in a sales database, one might wish to find the sales of technology products by month and store--the SQL query to do this would look like:
In this post, we study how column-stores and data cubes would evaluate this query on a sample database:
Column Store Analysis
In column-stores, this query would be answered by scanning the productType column of the products table to find the ids that have type technology. These ids would then be used to filter the productID column of the sales table to find positions of records with the appropriate product type. Finally, these positions would be used to select data from the months and stores columns for input into the GROUP BY operator. Unlike in a row-store, the column-store only has to read a few columns of the sales table (which, in most data warehouses, would contain tens of columns), making it significantly faster than most commercial relational databases that use row-based technology.
Also, if the table is sorted on some combination of the attributes used in the query (or if a materialized view or projection of the table sorted on these attributes is available), then substantial performance gains can be obtained both from compression and the ability to directly offset to ranges of satisfying tuples. For example, notice that the sales table is sorted on productID, then month, then storeID. Here, all of the records for a given productID are co-located, so the extraction of matching productIDs can be done very quickly using binary search or a sparse index that gives the first record of each distinct productID. Furthermore, the productID column can be effectively run-length encoded to avoid storing repeated values, which will use much less storage space (see my previous post on compression in column-stores). Run-length encoding will also be effective on the month and storeID columns, since for a group of records representing a specific productID, month is sorted, and for a group of records representing a given (productID,month) pair, storeID is sorted. For example, if there are 1,000,000 sales records of about 1,000 products sold by 10 stores, with sales uniformly distributed across products, months and stores, then the productID column can be stored in 1,000 records (one entry per product), the month column can be stored in 1,000 x 12 = 12,000 records, and the storeID column can be stored in and 1,000 x 12 x 10 = 120,000 records. This compression means that less the amount of data read from disk is less than 5% of its uncompressed size.
Data Cube Analysis

Data cube-based solutions (sometimes referred to as MOLAP systems for "multidimensional online analytical processing"), are represented by commercial products such as EssBase. They store data in array-like structures, where the dimensions of the array represent columns of the underlying tables, and the values of the cells represent pre-computed aggregates over the data. A data cube on the product, store, and month attributes of the sales table, for example, would be stored in an array format as shown in the figure above. Here, the cube includes "roll-up" cells that summarize the values of the cells in the same row, column, or "stack" (x,y position.) If we want to use a cube to compute the values of the COUNT aggregate, as in the query above, the cells of this cube would look like:
Here, each cell contains the count of the number of records with a given (productID,month,storeID) value. For example, there is one record with storeID=1, productID=2, and month=April. The "sum" fields indicate the values of the COUNT "rolled up" on specific dimensions; for example, looking at the lower left hand corner of the cube for Store 1, we can see that in storeID 1, productID 1 was sold twice across all months. Thus, to answer the above query using a data cube, we first identify the subset of the cube that satisfies the WHERE clause (here, products 3, 4, and 5 are technology products--this is indicated by their dark shading in the above figure.) Then, the system reads the pre-aggregated values from sum fields for the unrestricted attributes (store and month), which gives the result that store 2 had 1 technology sale in Feburary and 1 in June, and that store 3 had 1 technology sale in February and 1 in October.
The advantages of a data cube should be clear--it contains pre-computed aggregate values that make it a very compact and efficient way to retrieve answers for specific aggregate queries. It can be used to efficiently compute a hierarchy of aggregates--for example, the sum columns in the above cube make it is very fast to compute the number of sales in a given month across all stores, or the number of sales or a particular product across the entire year in a given store. Because the data is stored in an array-structure, and each element is the same size, direct offsetting to particular values may be possible. However, data cubes have several limitations:
Data cubes work well in environments where the query workload is predictable, so that cubes needed to answer specific queries can be pre-computed. They are inappropriate for ad-hoc queries or in situations where complex relational expressions are needed.
In contrast, column-stores provide very good performance across a much wider range of queries (all of SQL!) However, for low-dimensionality pre-computed aggregates, it is likely that a data-cube solution will outperform a column store. For many-dimensional aggregates, the tradeoff is less clear, as sparse cube representations are unlikely to perform any better than a column store.
Finally, it is worth noting that there is no reason that cubes cannot be combined with column-stores, especially in a HOLAP-style configuration where queries not directly answerable from a cube are redirected to an underlying column-store system. That said, given that column-stores will typically get very good performance on simple aggregate queries (even if cubes are slightly faster), it is not clear if the incremental cost of maintaining and loading an additional cube system to compute aggregates is ever worthwhile in a column-store world. Furthermore, existing HOLAP products, which are based on row-stores, are likely to be an order of magnitude or more slower than column-stores on ad-hoc queries that cannot be answered by the MOLAP system, for the same reasons discussed elsewhere in this blog.
SELECT month, store, COUNT(*)
FROM sales, products
WHERE productType = 'technology'
AND products.id = sales.productID
GROUP BY month, store
In this post, we study how column-stores and data cubes would evaluate this query on a sample database:
Column Store AnalysisIn column-stores, this query would be answered by scanning the productType column of the products table to find the ids that have type technology. These ids would then be used to filter the productID column of the sales table to find positions of records with the appropriate product type. Finally, these positions would be used to select data from the months and stores columns for input into the GROUP BY operator. Unlike in a row-store, the column-store only has to read a few columns of the sales table (which, in most data warehouses, would contain tens of columns), making it significantly faster than most commercial relational databases that use row-based technology.
Also, if the table is sorted on some combination of the attributes used in the query (or if a materialized view or projection of the table sorted on these attributes is available), then substantial performance gains can be obtained both from compression and the ability to directly offset to ranges of satisfying tuples. For example, notice that the sales table is sorted on productID, then month, then storeID. Here, all of the records for a given productID are co-located, so the extraction of matching productIDs can be done very quickly using binary search or a sparse index that gives the first record of each distinct productID. Furthermore, the productID column can be effectively run-length encoded to avoid storing repeated values, which will use much less storage space (see my previous post on compression in column-stores). Run-length encoding will also be effective on the month and storeID columns, since for a group of records representing a specific productID, month is sorted, and for a group of records representing a given (productID,month) pair, storeID is sorted. For example, if there are 1,000,000 sales records of about 1,000 products sold by 10 stores, with sales uniformly distributed across products, months and stores, then the productID column can be stored in 1,000 records (one entry per product), the month column can be stored in 1,000 x 12 = 12,000 records, and the storeID column can be stored in and 1,000 x 12 x 10 = 120,000 records. This compression means that less the amount of data read from disk is less than 5% of its uncompressed size.
Data Cube Analysis

Data cube-based solutions (sometimes referred to as MOLAP systems for "multidimensional online analytical processing"), are represented by commercial products such as EssBase. They store data in array-like structures, where the dimensions of the array represent columns of the underlying tables, and the values of the cells represent pre-computed aggregates over the data. A data cube on the product, store, and month attributes of the sales table, for example, would be stored in an array format as shown in the figure above. Here, the cube includes "roll-up" cells that summarize the values of the cells in the same row, column, or "stack" (x,y position.) If we want to use a cube to compute the values of the COUNT aggregate, as in the query above, the cells of this cube would look like:
Here, each cell contains the count of the number of records with a given (productID,month,storeID) value. For example, there is one record with storeID=1, productID=2, and month=April. The "sum" fields indicate the values of the COUNT "rolled up" on specific dimensions; for example, looking at the lower left hand corner of the cube for Store 1, we can see that in storeID 1, productID 1 was sold twice across all months. Thus, to answer the above query using a data cube, we first identify the subset of the cube that satisfies the WHERE clause (here, products 3, 4, and 5 are technology products--this is indicated by their dark shading in the above figure.) Then, the system reads the pre-aggregated values from sum fields for the unrestricted attributes (store and month), which gives the result that store 2 had 1 technology sale in Feburary and 1 in June, and that store 3 had 1 technology sale in February and 1 in October.
The advantages of a data cube should be clear--it contains pre-computed aggregate values that make it a very compact and efficient way to retrieve answers for specific aggregate queries. It can be used to efficiently compute a hierarchy of aggregates--for example, the sum columns in the above cube make it is very fast to compute the number of sales in a given month across all stores, or the number of sales or a particular product across the entire year in a given store. Because the data is stored in an array-structure, and each element is the same size, direct offsetting to particular values may be possible. However, data cubes have several limitations:
- Sparsity: Looking at the above cube, most of the cells are empty. This is not simply an artifact our sample data set being small--the number of cells in a cube is the product of the cardinalities of the dimensions in the cube. Our 3D cube with 10 stores and 1,000 products would have 120,000 cells, and adding a fourth dimension, such as customerID (with, say, 10,000 values), would cause the number of cells to balloon to 1.2 billion! Such high dimensionality cubes cannot be stored without compression. Unfortunately, compression can limit performance somewhat, as direct offsetting is no longer possible. For example, a common technique is to store them as a table with the values and positions of the non-empty cells, resulting in an implementation much like a row-oriented relational database!
- Inflexible, Limited ad-hoc query support: Data cubes work great when a cube aggregated on the dimensions of interest and using the desired aggregation functions is available. Consider, however, what happens in the above example if the user wants to compute the average sale price rather than the count of sales, or if the user wants to include aggregates on customerID in addition to the other attributes. If no cube is available, the user has no choice but to fall back to queries on an underlying relational system. Furthermore, if the user wants to drill down into the underlying data--asking, for example "who was the customer who bought a technology product at store 2 in February?"--the cube cannot be used (one could imagine storing entire tuples, or pointers to tuples, in the cells of a cube, but like sparse representations, this significantly complicates the representation of a cube and can lead to storage space explosions.) To deal with these limitations, some cube systems support what is called "HOLAP" or "hybrid online analytical processing", where they will automatically redirect queries that cannot be answered with cubes to a relational system, but such queries run as fast as whatever relational system executes them.
- Long load times: Computing a cube requires a complex aggregate query over all of the data in a warehouse (essentially, every record has to be read from the database.) Though it is possible to incrementally update cubes as new data arrives, it is impractical to dynamically create new cubes to answer ad-hoc queries.
Data cubes work well in environments where the query workload is predictable, so that cubes needed to answer specific queries can be pre-computed. They are inappropriate for ad-hoc queries or in situations where complex relational expressions are needed.
In contrast, column-stores provide very good performance across a much wider range of queries (all of SQL!) However, for low-dimensionality pre-computed aggregates, it is likely that a data-cube solution will outperform a column store. For many-dimensional aggregates, the tradeoff is less clear, as sparse cube representations are unlikely to perform any better than a column store.
Finally, it is worth noting that there is no reason that cubes cannot be combined with column-stores, especially in a HOLAP-style configuration where queries not directly answerable from a cube are redirected to an underlying column-store system. That said, given that column-stores will typically get very good performance on simple aggregate queries (even if cubes are slightly faster), it is not clear if the incremental cost of maintaining and loading an additional cube system to compute aggregates is ever worthwhile in a column-store world. Furthermore, existing HOLAP products, which are based on row-stores, are likely to be an order of magnitude or more slower than column-stores on ad-hoc queries that cannot be answered by the MOLAP system, for the same reasons discussed elsewhere in this blog.
Categories
Database architecture6 Comments
Leave a comment

Dear Sam,
Making comparative analysis between Columnar STORE and OLAP is something I always wanted to read about.
I understand advantages of a terrific performance of retrieving sorted and contiguous data. What leaves me wondering is whether comparison with MOLAP is sufficient.
Non-relationally implemented ROLAP(usually via enhanced R-Trees) is carrying a great promise. One of the most well known ROLAP engines within SIGMOD academia community, is called DWARF (http://www.cs.umd.edu/~nick/projects/Dwarf.html).
DWARF stores precalculated materialized views, which are pre-sorted. Which makes me not seeing an advantage to the scalability of Analytical Datawarehouse being brought by Columnar Store.
I am issue there is something I am not seeing, which makes HOLAP approach (Metadata Repository in DWARF and Data Repository in Columnar Store) most advatageous.
If you could provide comparison of Columnar Store against DWARF ROLAP, that would be greatly appreciated. I mean not from benchmarking point view, but rather through comparison of core technologies by pointing onto the pros and cons of each to get a better feel of when using DWARF ROLAP as a storage for Metadata and Data Repositories is advantageous to the HOLAP approach.
Kindest regards,
Arkady G.
[ed. abbreviated commenter's name and removed email address]
This article is up to date as far as the column store developments but outdated on data cubes. It ignores the work on highly compressed Dwarf data cubes for which load time and size have been shown to be efficient both experimentally and analytically http://www.cs.umd.edu/~nick/projects/Dwarf.html. Being a co-author in the Dwarf technology, I would really like to see a deeper comparison and a benchmark of Dwarf and column store storage. Both eliminate prefix redundancy. Dwarf also eliminates suffix redundancy induced by the sparsity of the 2^D sub-groupings of the cube but not clear if it is applicable to column store. It would be also very useful to investigate the support of hierarchical dimensions in column stores and see how it compares to Dwarf’s. Nick R., University of Maryland [ed. abbreviated commenter's name and removed email address]
Nick -- I agree that a comparison with Dwarf cubes would be very
interesting, particularly if we could set up some actual experimental
benchmarks. Off the top of my head, it looks like Dwarf cubes
overcome problems with high dimensionality cubes, but they still have
to be loaded outside of the database system and lack the full
expressiveness of SQL.
Nick, Sam:
That study exists.
We recently re-examined Dwarf indexes to better understand their performance characteristics. Some of our findings were that Dwarf indexes are very sensitive to skew and high dimensions. We also looked back at experiments presented by the inventors in SIGMOD 2002. However, some of the results could not be reproduced.
In terms of "compression" in the context of Dwarfs it is worth noting that in the Dwarf paper compression is measured w.r.t. the fully materialized cube -- however, nobody would ever materialize the full cube.
The measure we examined instead was the size of a Dwarf index when compared to the original fact table size. For some data sets we received Dwarfs 5,000 times bigger than the original fact table --- which we think is an impressive number. This had of course impact on index build time/storage requirements but also query performance.
In addition, we compared Dwarf query performance against row-scans and column-scans. We found that although for some scenarios a Dwarf may be competitive, for several scenarios (especially so-called coarse-grained Dwarfs) Dwarfs will be beaten by a simple column-scan. For which scenario a Dwarf wins seems to be hard to predict.
Overall, we think that Dwarfs will only pay-off for very special cases. Moreover, the technique "as-is" requires careful extra mechanisms to make it robust and to be useful in practice (to avoid unexpectedly large index sizes and index build times).
I strongly believe that the current developments in hardware will make these kind of structures more and more obsolete.
You will find all details in our paper:
Dwarfs in the Rearview Mirror: How Big are they Really?
VLDB 2008: to appear, Auckland, New Zealand.
Best regards,
Jens, Marcos
Jens, Marcos -- Thanks very much for the feedback.
We'll take a look at your paper.
You can see many articles related to column oriented databases that have sprung up in the past few months. This is a hot topic because in a few years the current database technology will have a very difficult time to deal with this amount of data. We've been using Sybase IQ for sometime now to handle several billion new records a day and analysis of 50 TB online. Without this database it would have been impossible to run the analysis that we run, so I'm not sure if it fits in every situation, and I'm not sure IQ is the only answer, but the technology is incredible.
Dave