Debunking a Myth: Column-Stores vs. Indexes
Consider a traditional, row-oriented database. Indexes are known to improve query performance. They can greatly reduce I/O costs by avoiding the need to perform table scans since they directly contain the data you need to answer a query or contain pointers to such data. If you have a query that accesses only two out of thirty columns from a large table, and you have an index on these two columns, then you can use the indexes to avoid scanning all of the data in the table.
A challenge when using a traditional database is deciding what indexes to create on your tables. One either pays a DBA to carefully choose the right set of indexes to optimize a target workload, or you buy a database with an auto-tuning feature to create this set of indexes automatically (which might not be as good as a human DBA).
Ideally, it would be possible to have an index on every column. Unfortunately, every index you create results in the materialization of another copy of the column data (in addition to having other space overheads for pointers and other parts of the index data structure). Thus, the size of your database would be enormous if you had an index on every column. Even if you had infinite storage space (so that this explosion in data storage was not an issue), index maintenance is very expensive. Updates and inserts need to be reflected in the raw data and all of the indexes. Hence, there is a fundamental trade off - indexes improve query performance, but cost you in storage and maintenance. This is why you need an expert to choose the right set of indexes.
Now consider a column-store. By storing each column separately, the benefit appears similar to having an index on every column in a table in a row-store. If you have a query that accesses only two out of thirty columns from a large table, the column-store only reads those two columns and can avoid the enormous table scan (just like having an index). However, since it is the raw copy of the data that is stored in columns, no additional copies of the data need to be created, so the storage and update overheads associated with indexes is avoided.
Thus, one might expect column-stores to perform similarly to a row-store with an index on every column without the corresponding negatives of creating many indices. In fact, this is a common argument we have often heard regarding column-stores and their expected performance relative to carefully designed row-stores -- both approaches provide good read performance, with the column store providing lower total cost of ownership (since you don't have to figure out what indexes to create anymore).
Though this argument sounds reasonable, it is completely incorrect. It is also dangerous since it might cause you to end up choosing a row-store when what you really need is a column-store.
Assume the following situation:
a) You already have a license for a commercial row-store
b) You have tons of extra storage space
c) You have a read-only workload (so index maintenance is not an issue)
Using the above reasoning, in this situation you would not need to go out and buy a column-store. You would just create an index on every column on your row-store.
In our SIGMOD 2008 paper, "Column-Stores vs. Row-Stores: How Different Are They Really?" which we presented last month in Vancouver, we explored this situation, running a commercial row-store (with no storage restrictions) on a read-only benchmark. The benchmark we used was the Star Schema Benchmark, a recently proposed benchmark designed to be more "typical" of data warehousing data and queries than TPC-H. We compared the performance of the commercial row-store (where we created an index on every column and forced the database to always use these indexes to access data instead of using a full table scan to access data) with the same row-store under a more normal configuration (optimized by a professional DBA) and with a column-store. The results are shown in the figure below:
The fact that the column-store was almost a factor of six faster than the row-store was not surprising. After all, column-stores are supposed to outperform row-stores for data warehousing workloads. But if one views a column-store as similar to a row-store with an index on every column, one would have expected the row-store (all-indexes) approach to perform about as fast as the column-store. Instead, it performed over a factor of 50 slower, and almost an order of magnitude slower than the same commercial row-store that used full table scans to access data instead of index accesses!
So what's going on here? It turns out that a column in a column-store is very different from an index. A column in a column-store stores attribute data in the same order that it appeared in the original table (or from a sorted projection of that table). You can think of this as mapping tuple ID to column value. For example, as shown in Figures 2 - 4, if you want the value for the "customer city" attribute for the 6th tuple in a table (or projection), you can find this value by jumping to the 6th value in the "customer city" column. On the other hand, an index contains the exact opposite mapping. It maps a column value to tuple ID. If you want to find the tuple ID for all tuples whose "customer city" is "Denver", an index is great. But what if you want to find the "customer city" of the 6th tuple? You would have to scan the whole index, looking for tuple ID 6.
So indexes are often useful in first part of query execution where predicate evaluation occurs (dealing with the "WHERE" part of a SQL statement), where you are looking for tuples with specific values (it turns out that even then, indexes are only useful for very selective predicates). But for the later part of the query plan, where the database is extracting values for attributes for specific tuple IDs (the "SELECT" and "GROUP BY" part of a SQL statement), you want a tuple ID to value mapping, and a column is better than an index. The reason why the "row-store all-indexes" approach went so slow in our experiments is that for each tuple ID produced by evaluating the predicates in the SQL "WHERE" clause, the database would have to search the index (using the wrong mapping) for each attribute that appeared in the "GROUP BY" and "SELECT" clauses. This can be thought of as adding one additional join to the query for each attribute that appears in the "GROUP BY" and "SELECT" clauses.
Hence, an index and a column are quite different data structures. Of course, there are some situations where what you really want is an index and not a column. For example, if you had a query workload with a lot of "needle-in-the-haystack" queries (queries with very selective predicates), you need to use a lot of indexes. If you have the incorrect perception that a column-store is pretty much the same as a row-store with an index on every column, you might be tempted to use a column-store. In fact, what you really want is a heavily indexed database (either a row-store, an indexed column-store, or a column-store with multiple redundant sort orders).
An astute reader might ask the question: what if the row-store was able to have indexes that mapped tuple-ID to value instead of the other way around? We studied that idea too, and although this significantly improves the performance of the all-index approach, it still does not approach the performance of the column-store. We will explain why this is the case in a future blog post.
(Ed. This article was co-authored by Sam Madden)
A challenge when using a traditional database is deciding what indexes to create on your tables. One either pays a DBA to carefully choose the right set of indexes to optimize a target workload, or you buy a database with an auto-tuning feature to create this set of indexes automatically (which might not be as good as a human DBA).
Ideally, it would be possible to have an index on every column. Unfortunately, every index you create results in the materialization of another copy of the column data (in addition to having other space overheads for pointers and other parts of the index data structure). Thus, the size of your database would be enormous if you had an index on every column. Even if you had infinite storage space (so that this explosion in data storage was not an issue), index maintenance is very expensive. Updates and inserts need to be reflected in the raw data and all of the indexes. Hence, there is a fundamental trade off - indexes improve query performance, but cost you in storage and maintenance. This is why you need an expert to choose the right set of indexes.
Now consider a column-store. By storing each column separately, the benefit appears similar to having an index on every column in a table in a row-store. If you have a query that accesses only two out of thirty columns from a large table, the column-store only reads those two columns and can avoid the enormous table scan (just like having an index). However, since it is the raw copy of the data that is stored in columns, no additional copies of the data need to be created, so the storage and update overheads associated with indexes is avoided.
Thus, one might expect column-stores to perform similarly to a row-store with an index on every column without the corresponding negatives of creating many indices. In fact, this is a common argument we have often heard regarding column-stores and their expected performance relative to carefully designed row-stores -- both approaches provide good read performance, with the column store providing lower total cost of ownership (since you don't have to figure out what indexes to create anymore).
Though this argument sounds reasonable, it is completely incorrect. It is also dangerous since it might cause you to end up choosing a row-store when what you really need is a column-store.
Assume the following situation:
a) You already have a license for a commercial row-store
b) You have tons of extra storage space
c) You have a read-only workload (so index maintenance is not an issue)
Using the above reasoning, in this situation you would not need to go out and buy a column-store. You would just create an index on every column on your row-store.
In our SIGMOD 2008 paper, "Column-Stores vs. Row-Stores: How Different Are They Really?" which we presented last month in Vancouver, we explored this situation, running a commercial row-store (with no storage restrictions) on a read-only benchmark. The benchmark we used was the Star Schema Benchmark, a recently proposed benchmark designed to be more "typical" of data warehousing data and queries than TPC-H. We compared the performance of the commercial row-store (where we created an index on every column and forced the database to always use these indexes to access data instead of using a full table scan to access data) with the same row-store under a more normal configuration (optimized by a professional DBA) and with a column-store. The results are shown in the figure below:
The fact that the column-store was almost a factor of six faster than the row-store was not surprising. After all, column-stores are supposed to outperform row-stores for data warehousing workloads. But if one views a column-store as similar to a row-store with an index on every column, one would have expected the row-store (all-indexes) approach to perform about as fast as the column-store. Instead, it performed over a factor of 50 slower, and almost an order of magnitude slower than the same commercial row-store that used full table scans to access data instead of index accesses!
So what's going on here? It turns out that a column in a column-store is very different from an index. A column in a column-store stores attribute data in the same order that it appeared in the original table (or from a sorted projection of that table). You can think of this as mapping tuple ID to column value. For example, as shown in Figures 2 - 4, if you want the value for the "customer city" attribute for the 6th tuple in a table (or projection), you can find this value by jumping to the 6th value in the "customer city" column. On the other hand, an index contains the exact opposite mapping. It maps a column value to tuple ID. If you want to find the tuple ID for all tuples whose "customer city" is "Denver", an index is great. But what if you want to find the "customer city" of the 6th tuple? You would have to scan the whole index, looking for tuple ID 6.
So indexes are often useful in first part of query execution where predicate evaluation occurs (dealing with the "WHERE" part of a SQL statement), where you are looking for tuples with specific values (it turns out that even then, indexes are only useful for very selective predicates). But for the later part of the query plan, where the database is extracting values for attributes for specific tuple IDs (the "SELECT" and "GROUP BY" part of a SQL statement), you want a tuple ID to value mapping, and a column is better than an index. The reason why the "row-store all-indexes" approach went so slow in our experiments is that for each tuple ID produced by evaluating the predicates in the SQL "WHERE" clause, the database would have to search the index (using the wrong mapping) for each attribute that appeared in the "GROUP BY" and "SELECT" clauses. This can be thought of as adding one additional join to the query for each attribute that appears in the "GROUP BY" and "SELECT" clauses.
Hence, an index and a column are quite different data structures. Of course, there are some situations where what you really want is an index and not a column. For example, if you had a query workload with a lot of "needle-in-the-haystack" queries (queries with very selective predicates), you need to use a lot of indexes. If you have the incorrect perception that a column-store is pretty much the same as a row-store with an index on every column, you might be tempted to use a column-store. In fact, what you really want is a heavily indexed database (either a row-store, an indexed column-store, or a column-store with multiple redundant sort orders).
An astute reader might ask the question: what if the row-store was able to have indexes that mapped tuple-ID to value instead of the other way around? We studied that idea too, and although this significantly improves the performance of the all-index approach, it still does not approach the performance of the column-store. We will explain why this is the case in a future blog post.
(Ed. This article was co-authored by Sam Madden)
Categories
Database architecture , Database miscellaneous

Leave a comment