Follow up on compression post: Columns, indices, and sorting
Earlier this week I wrote about the advantages of compression in column-oriented databases (read the post here). A reader had questions about an example I used and the issue of sorting and indexes. I thought the commenter's points and questions were worth exploring in some depth.
Keeping columns in order
The commenter's key question was really about maintaining the correspondence between columns in a column-oriented database. The issue isn't really "getting the data back in the original order" as much as it is ensuring that the system has a way of accessing records in other columns that correspond to a particular record in a sorted column. I'll explain the details of how this works.
The simplest way to do this is just store one copy of the table sorted on one or more attributes. For example, suppose we have the following table sorted on state:
We can physically represent this as three columns each stored in the order shown in a different file. We can RLE encode the state column to reduce its space, and, if we secondarily sort the table on salaries, can probably also benefit from delta encoding salaries and dictionary encoding names. Any query that includes a predicate over state (e.g., state = 'ca') can be evaluated quickly because we can figure out what positions (e.g., 1 and 2) in the three columns need to be looked up to find matching records without decompressing the state column or looking at all of records of the other two columns.
Of course, we need to be able to efficiently perform this look up of a given set of positions in a column (e.g., finding the names and salaries that correspond to positions 1 and 2 in the state column). We can do this much faster than a conventional join would (which is what the commenter's foreign key suggestion would require) because the name and salary columns are stored in the same order as the state column (so the 1st state corresponds to the 1st entry in the name file, etc.). This allows us to either:
Notice that the positions will be accessed sequentially within each column file, which further improves performance (especially if a range of consecutive positions are accessed).
That said, the commenter was correct in that in this example we will really only get the maximum benefit of RLE on the state column with (if we sort them at all) a potentially decreased benefit for a secondary or tertiary sort on name or salary.
One way to address this is to replicate subsets of columns, sorting each subset in a different order. We call such subsets "projections" (where the columns in a projection are still each stored in separate files). For example, a user might create one projection with the above three columns sorted in state order and another projection with just name and salary sorted in salary order (which will compress well using delta or RLE encoding). That way, if a query wants to find employees in a particular salary range, it can do that quickly using this second projection. Of course, for a query to be able to use a particular projection, that projection must have the columns used in the query (or the system must run an expensive join operation to glue together two projections that are sorted in different orders).
Future post: Projections to maximize query performance
This, of course, presents an interesting physical database design question: What projections should the user (or administrator) create to maximize query performance for a given query workload? This is the idea behind the automatic database design features in C-Store and Vertica, which will be the subject of a future post.
Keeping columns in order
The commenter's key question was really about maintaining the correspondence between columns in a column-oriented database. The issue isn't really "getting the data back in the original order" as much as it is ensuring that the system has a way of accessing records in other columns that correspond to a particular record in a sorted column. I'll explain the details of how this works.
The simplest way to do this is just store one copy of the table sorted on one or more attributes. For example, suppose we have the following table sorted on state:
| name | state | salary |
| joe | ca | 10k |
| dave | ca | 55k |
| mary | ma | 50k |
| nancy | ma | 65k |
| mike | nv | 30k |
We can physically represent this as three columns each stored in the order shown in a different file. We can RLE encode the state column to reduce its space, and, if we secondarily sort the table on salaries, can probably also benefit from delta encoding salaries and dictionary encoding names. Any query that includes a predicate over state (e.g., state = 'ca') can be evaluated quickly because we can figure out what positions (e.g., 1 and 2) in the three columns need to be looked up to find matching records without decompressing the state column or looking at all of records of the other two columns.
Of course, we need to be able to efficiently perform this look up of a given set of positions in a column (e.g., finding the names and salaries that correspond to positions 1 and 2 in the state column). We can do this much faster than a conventional join would (which is what the commenter's foreign key suggestion would require) because the name and salary columns are stored in the same order as the state column (so the 1st state corresponds to the 1st entry in the name file, etc.). This allows us to either:
- Directly offset to the locations in the columns where a particular position is stored, if the column is uncompressed and fields are fixed length (e.g., each record is the same size). This would be the case for salary in the previous example if it were uncompressed, or, if each record is variable-sized, or the column is compressed (as would be the case if name is stored as a text field, for example) such that we can't apply direct offsetting.
- Use a sparse index to map positions to entries in a column. A sparse index is simply an index with one entry per *page* of the column file where an entry indicates the range of positions on the corresponding page. So if the name column has 10,000 names stored on 100 pages, there would 100 entries in the sparse index. To look up a given position, the system searches the index to quickly find the pages that contain the positions of interest. Because these sparse indices are very compact, they can usually be stored in memory or on a single disk page, and this extra lookup imposes very little overhead (unlike a traditional database index which can be quite large and require a significant number of I/Os to query).
Notice that the positions will be accessed sequentially within each column file, which further improves performance (especially if a range of consecutive positions are accessed).
That said, the commenter was correct in that in this example we will really only get the maximum benefit of RLE on the state column with (if we sort them at all) a potentially decreased benefit for a secondary or tertiary sort on name or salary.
One way to address this is to replicate subsets of columns, sorting each subset in a different order. We call such subsets "projections" (where the columns in a projection are still each stored in separate files). For example, a user might create one projection with the above three columns sorted in state order and another projection with just name and salary sorted in salary order (which will compress well using delta or RLE encoding). That way, if a query wants to find employees in a particular salary range, it can do that quickly using this second projection. Of course, for a query to be able to use a particular projection, that projection must have the columns used in the query (or the system must run an expensive join operation to glue together two projections that are sorted in different orders).
Future post: Projections to maximize query performance
This, of course, presents an interesting physical database design question: What projections should the user (or administrator) create to maximize query performance for a given query workload? This is the idea behind the automatic database design features in C-Store and Vertica, which will be the subject of a future post.
Categories
Database architecture1 TrackBacks
Listed below are links to blogs that reference this entry: Follow up on compression post: Columns, indices, and sorting.
TrackBack URL for this entry: http://www.databasecolumn.com/blog/mt-tb.cgi/13
» Relational database pioneer doesn't quite say technology is obsolete! from David Portas' Blog
According to Computerworld , Mike Stonebraker says "RDBMSes, are 'long in the tooth' and Read More
1 Comments
Leave a comment

Hi, and thanks for the additional post on this topic.
“Of course, for a query to be able to use a particular projection, that projection must have the columns used in the query (or the system must run an expensive join operation to glue together two projections that are sorted in different orders).”
This is what I was getting at, and seems a very important point. It seems you need many “projections” to really get the benefit of compression, yet each projection needs to have many columns in it in order to make it useful on its own for querying. Stated differently, one projection with all columns only provides good compression on the primary sort column, and is useful for querying, while many projections with a few columns each will provide good compression for as many columns as there are projections, but only very specific queries will work without expensive joins to “glue” the projections together. On the other hand, if you put too many columns into each projection, you may end up using more space than the original table. It seems the benefits (of compression and projections) are highly sensitive to the number of columns in a table and the types of queries that may be applied. (Looking forward to the next promised post on how to design the projections)
I see how it may be useful for very specific types of problems, but not as a general purpose RDMBS, or even a general OLAP tool.
Also, doesn’t having one column in many projections make update operations slower and more difficult?