Stonebraker comments on OODB market failures, data warehouse pain, and column advantages
In my recent post, a reader made the following comments:
In my opinion, all three claims are false.
OODBs failed for other reasons than the inclusion of OO technology in RDBMS
First, OODB were designed and built for the engineering database market. The technology's main focus was on persistence of programming language objects and not on business data processing features such as a strong transaction system and SQL support. OODB vendors were unsuccessful in selling to this market for a variety of reasons -- reasons that are too lengthy to go into here. However, the main reason for their lack of market success was their inability to construct a value proposition with sufficient return on investment for the engineering customer. The demise of OODB has little to do with the inclusion of OO features in RDBMS, an effort that my Postgres system was in the forefront of.
Data warehouse pain is real
Second, I have talked to perhaps 50 CIOs and warehouse DBAs over the last year. Their pain is evident, and is due to:
As a result, the typical warehouse administrator is:
The success of Teradata, Netezza, and a collection of other hardware appliance vendors is proof-positive of extreme pain in this market. Organizations who are not feeling great pain would never bring such custom iron into their shop with its resulting hardware lock-in and high prices.
Column advantages are proven
Concerning the third point, Vertica has been "baked off" against row stores by more than 20 enterprises, each comparing us to their current row-store solution on a schema, typically optimized by the row-store vendor for their application. We have yet to win a benchmark by less than an order of magnitude. The reasons for this dramatic performance advantage are:
Summing it up
The commenter does make an accurate point about performance. There are two kinds of queries, repetitive queries, such as end of week sales reports, and ad hoc queries, such as "I wonder if" queries from business analysts.
One way or another (precube the data, use materialized views, or perform some other indexing technique) DBAs usually provide adequate performance for repetitive queries as the commenter notes. Hence, it is often the ad hoc queries that are giving DBAs the most pain. Because they are not known in advance, the precomputation tricks mentioned do not work.
As such, we see the most user pain around ad hoc queries and the inability of the row-store providers to scale up with data volumes.
In summary, column-store architectural features:
- OODBs failed because RDBMS vendors added OO functionality
- Data warehouse users are not in pain
- Column stores only beat row stores on poorly designed warehouse schemas
In my opinion, all three claims are false.
OODBs failed for other reasons than the inclusion of OO technology in RDBMS
First, OODB were designed and built for the engineering database market. The technology's main focus was on persistence of programming language objects and not on business data processing features such as a strong transaction system and SQL support. OODB vendors were unsuccessful in selling to this market for a variety of reasons -- reasons that are too lengthy to go into here. However, the main reason for their lack of market success was their inability to construct a value proposition with sufficient return on investment for the engineering customer. The demise of OODB has little to do with the inclusion of OO features in RDBMS, an effort that my Postgres system was in the forefront of.
Data warehouse pain is real
Second, I have talked to perhaps 50 CIOs and warehouse DBAs over the last year. Their pain is evident, and is due to:
- Warehouse size is going up rapidly, and query complexity goes up at about the square of database size
- Load volumes are going up rapidly, and load times increase linearly with volume
As a result, the typical warehouse administrator is:
- Saying "no" to ad-hoc query requests from business intelligence users
- Having increasing trouble loading data within the load window allotted
The success of Teradata, Netezza, and a collection of other hardware appliance vendors is proof-positive of extreme pain in this market. Organizations who are not feeling great pain would never bring such custom iron into their shop with its resulting hardware lock-in and high prices.
Column advantages are proven
Concerning the third point, Vertica has been "baked off" against row stores by more than 20 enterprises, each comparing us to their current row-store solution on a schema, typically optimized by the row-store vendor for their application. We have yet to win a benchmark by less than an order of magnitude. The reasons for this dramatic performance advantage are:
- Only reading the columns you need. We see fact tables with anywhere from 40 to 200 attributes. Warehouse queries typically read 5 or less columns. A column database reads exactly the columns needed; a row store reads all the columns. In round numbers this is an order of magnitude performance penalty.
- Superior compression. Columnar compression is more effective than the schemes used by row stores because blocks read by column databases have only one data type in them (a portion of a column) while row stores have several data types (a collection of tuples). Compressing one data type is fundamentally easier than compressing several. Moreover, in Vertica's case, it does not store an explicit tuple identifier or space-wasting bit maps of non-null fields. Hence, we typically see columnar compression beating row-store compression by a factor of 2.
- Executor runs on compressed data. The row stores uncompress a block when it is brought off the disk because they have legacy executors that deal with uncompressed data. In contrast, the Vertica executor runs on compressed data. This results in better L2 cache performance, copying less bytes of data, etc.
- Inner loop is column-oriented not row-oriented. A row-oriented query plan has an inner loop that picks up a tuple and does something with it. A column-oriented query plan has an inner loop that picks up a column and does something with it. In a fact table query, there might be 10 ** 9 rows but only 5 relevant columns. Hence, the inner loop, with its inherent overhead, is executed vastly less times in a column store.
Summing it up
The commenter does make an accurate point about performance. There are two kinds of queries, repetitive queries, such as end of week sales reports, and ad hoc queries, such as "I wonder if" queries from business analysts.
One way or another (precube the data, use materialized views, or perform some other indexing technique) DBAs usually provide adequate performance for repetitive queries as the commenter notes. Hence, it is often the ad hoc queries that are giving DBAs the most pain. Because they are not known in advance, the precomputation tricks mentioned do not work.
As such, we see the most user pain around ad hoc queries and the inability of the row-store providers to scale up with data volumes.
In summary, column-store architectural features:
- Result in dramatically better performance
- Cannot be copied by the row-store vendors -- you cannot be both a row store and a column store at the same time
Categories
Database architecture2 Comments
Leave a comment

What about in-memory databases? One model I have thought about is tokenization. I have not seen a database that does this. All of the unique values in a column are assigned an integer ID. Real-word data does not contain an infinite set of values, plus the set of unique values is never more than the number of rows in the table. Rows are then compressed down to the minimum number of bits needed to represent the tokens. Fifty states get six bits, etc. This model makes the tables much smaller, small enough to fit into RAM (you can put a lot of RAM on a 64b machine).
Now that the tables are in RAM and fixed length in row size, you can sort them very quickly. Queries would sort the actual tables. Mathematical joins of tables work because all the rows are fixed length. A foreign key lookup is an array access. SQL pattern matches are precomputed against the token maps and turned into arrays of corresponding token values.
The token mapping tables are not kept in RAM unless you have room. Newly created row values are assigned incrementally unique tokens. Of course if you overflow a field width everything needs to stop and be adjusted. Since tokens are assigned unique ascending values, query results can be post-processed to undo the tokenization without stopping other queries. Ordering of token assignments can help with comparison functions, just treat newly appended values as exceptions, you can always stop everything and renumber the tokens if there are too many new values.
It may even be possible to achieve transactions on this model by updating the in-memory copy directly and logging the transaction. The logs are then processed in another process and update the disk based copy of the tables for recovery.
In this model a one billion row table with 20 columns never takes more that (1 billion * 20 * 30 bits) = 75GB RAM. But real world data doesn't have unique data in every column. It is likely that a real world table would be 25GB or less. You can also eliminate columns that are never used in a query and recover them during the detokenization process.
Of the four bullet point arguments for Vertica that you make in your post, the two that seem suspect are the case for compression – not because compression won’t make some queries run faster, but because it seems setting up the compression does not dove tail with the goal fast of ad-hoc queries. (The remaining arguments are certainly tried and true – column-stores having been used by APL programmers for well over 30 years, with kdb being a current, successful commercial implementation.)
As I understand it, a set of one or more columns are specified as a projection. The rows of this projection are sorted on one or more of the columns, and the resulting columns are each individually compressed by say, run-length encoding. The primary column (first sorted) will get the best compression, while secondary and tertiary columns will do less well. Certainly there will tend to be few columns in a projection. This implies there will be many projections in a large table. Each projection must have an associated “index vector” stored (and updated) along with it. This index vector is an array of 32 or 64 bit integers that has as many items as there are rows in the table. The index vector maps the sorted projection to the original table, or to other projections. Restoring a compressed projection with this index vector is the cost of a join, and, in a column-store, it is identical to the cost of joining two tables by a foreign key. In fact, the “index vector” is a foreign key.
Take the limiting case of one projection for each column, and only one column in each projection. In this case, the storage requirements might well increase as the index vector may be larger than the original column. Furthermore, any useful query will require joining multiple projections, so the table would perhaps be better uncompressed.
Take the other end limiting case of one projection for each column, and all columns in each projection. This will also increase required storage, as the compression per column drops drastically after the first few. However, a where clause on the primary or secondary or tertiary sorted columns will be very fast, and to the extent that the query returns columns that are in the particular projection, it will be very fast.
Cleary Vertica intends the set of projections of a table to be somewhere in the middle. But doesn’t having to carefully specify and tune projections to the specific query requirements of an application somewhat conflict with the goal of making ad-hoc queries fast? Or do you see this being done automatically?