CPU trends, like disk trends, will favor adoption of column stores

| | Comments (5)
In a recent post, we discussed how mass storage technology trends favor the use of a column-store architecture in database systems designed primarily for processing decision support queries. In this post, Sam Madden and I reflect on why CPU trends may have a similar influence on database design choice.


Slotted pages in row stores can slow down CPU performance

Most row-store architectures use the "slotted pages" concept to place records on disk pages. In addition to some control information, a slotted page can be roughly divided into two major areas as shown in the diagram.

rowstores.jpg The first area, which generally grows from the front of the page, is used to hold tuples packed tightly one after another on the page. The second area, termed the "slot array," starts at the end of the page and grows toward the front. There is one entry in this array for each tuple on the page. Each entry consists of the offset of the corresponding tuple from the start of the page and the tuple's length.

With this layout scheme tuples are addressed by a triple consisting of a file or volume identifier, a page number within the file/volume, and a slot number. Such addresses, commonly referred to as TIDs (for tuple ID) or RIDs (for record ID), provide a level of indirection that turns out to be quite important. In particular, tuples can be put in a different location on the data page while retaining the same slot ID. This means it is possible to accommodate insertions, deletions, and updates without affecting indices with entries pointing to tuples on the page.

Now consider what happens as a selection predicate is applied to a file of tuples. Two iterators are used. The first iterator iterates through all the pages in the file. For each page returned by this iterator, a second iterator, which encapsulates the page logic, is initialized and then called once for each record on the page. Each time it is called it advances to the "next" location in the slot array to obtain the offset of the corresponding tuple on the page. 
Once the offset of the start of the tuple is obtained, the database system's tuple-handling logic computes the offset of the attribute to which the predicate is being applied. These two offsets are then added to the address of the start of the page in memory to get the address of the attribute. Once this address has been computed, the predicate can be applied.

In this process, at least two memory accesses per tuple are performed -- one to access the slot array entry and one to access the attribute to perform the selection predicate. For modern CPUs, a critical factor governing how fast an application runs is how often a memory access results in a L2 (data) cache miss. Such misses generally cause modern CPUs to stall for 100s of cycles (waiting for data to be transferred from the main memory into the cache), giving the application, in this case the database system, the appearance that it is running on megahertz -- not a gigahertz -- CPU.

Cache lines on today's CPUs are typically either 64 bytes (Intel Core 2 Duo) or 128 bytes (Intel Xeon for both L2 and L3 caches). Assume the computer's cache line is 64 bytes. Assuming the cache is "cold" -- that is, we haven't recently accessed the current page -- 4-byte slot array entries (2 bytes each of offset and length) will incur an L2 cache miss on every 16th access to the slot array, causing another 64 bytes of memory to be read into the cache. On the other hand, every attribute access will incur an L2  cache miss -- unless, of course, tuples are 32 or fewer bytes long, in which case each L2 miss will pull two tuples from memory into the cache. If nulls are implemented using a bit array at the front of the tuple the situation may be worse as an additional memory access may be needed to access the bit array to check whether the attribute is null or not. This access will incur a L2 data cache miss. If the attribute is not null, a second miss will occur, except when the attribute and null bit array fall into the same 64-byte cache line.

To make this issue more concrete, assume a 32K-byte page size and 200-byte tuples. Each page will hold about 160 tuples. Scanning each page will incur at least 170 L2 cache misses (10 for scanning the slot array and 160 for scanning the tuples themselves) to process these 160 tuples. Modern processors do implement "prefetching," which detects consecutive memory accesses within a cache line and automatically initiates a load of the next cache line before it is referenced. However, this is only likely to help when accessing the slot array (which is walked sequentially), whereas following pointers from the slot array to tuples causes random access throughout the page that cannot be predicted by the prefetcher.


Column stores avoid the slotted page CPU penalty

Now consider what happens with a column store in which each column of a table is stored in a separate file. To simplify the discussion for now, assume no compression is used and that the attribute to which the predicate is being applied is a 4-byte integer. We assume that the column store processes updates, inserts, and deletes by writing them to a separate write- optimized store and periodically merging them into the main data store. This merging process rewrites both the main store as well as any associated indices. This means that attribute values in the main data store can be "dense packed" one after another and that tuple identifiers independent of page position are not needed in the main data store. Hence, column stores do not need a slot array in the main store.

Consider the process of scanning the values on each page to apply the selection predicate. With a 64-byte cache line and 4-byte integer attribute values, every 16th access to an attribute will incur a L2 data cache miss. Processing 160 attributes will incur just 10 misses compared to the 170 misses incurred by a row store -- assuming, again, no cache prefetching. Prefetching is likely to help the column store more than the row store, as the data values themselves are now being accessed sequentially rather than the random pattern of access generated by following pointers from the slotted page array.

The performance difference actually gets worse with bigger cache lines. For example, with 128-byte cache lines, the row store will incur 165 misses for every 160 tuples processed while the column store will incur only 5 misses.

As we mentioned in a previous posting, column stores are highly compressible. Assume that RLE compression is used on a column of integer attributes and that compressing the column yields a compression factor of 10. Each compressed column entry will consist of a (value, position, count) triple and will occupy 10 bytes. With a 64-byte cache line, each miss will bring 6 RLE triples into the cache. Since 160 tuples will, on the average, be compressed into 16 RLE triples, processing 160 tuples will incur only 3 cache misses. That is more than a 3X reduction compared to an uncompressed column store and a 30X reduction from an uncompressed row store!


PAX is an option for row stores, but has not been used

For the sake of "truth in advertising," there is an alternative way of laying out row store tuples on disk pages called PAX (Partition Attributes Across). Developed by Ailamaki and DeWitt, PAX partitions each data page into n smaller pages, one for each of the n attributes of the tuple (read the paper, Weaving Relations for Cache Performance from the 2001 VLDB Conference here). As each tuple is inserted into a page, each of its attributes gets stored in the corresponding minipage. The overall effect is a data page organized as n columns. While such a design has essentially the same cache behavior as a column store, it has the same I/O performance as a traditionally organized row store that uses slotted pages. As far as we are aware, no database vendor has yet adopted PAX for use in its products.


Comparing the impact on row and column stores

So what kind of performance difference does this mean when comparing row to column store architecture? Suppose for each memory access to an attribute, we spend 500 CPU cycles "computing" on it (this is consistent with estimates from recent database papers, such as the above-referenced PAX paper). If memory latency is 50 ns, then a L2 cache miss takes about 300 cycles on a 3 GHz processor. On a Core 2 Duo, accessing a record in L2 cache takes 14 cycles. In a non-row-oriented database, with 64-byte caches lines, 200-byte tuples with null headers, and 4-byte slot arrays, memory latencies will add about 660 cycles -- that's 2 + 1 in 16 L2 misses, plus 3 L2 accesses -- per record (1160 cycles in total). In contrast, in a column store, memory latencies will add only 45 cycles -- 1 in 16 L2 misses plus 2 L2 accesses -- per record  (545 total). Therefore, if a database is CPU limited -- which is commonly the case as many production databases have enough disks per CPU to ensure this -- a column store will have more than double the tuple throughput of a row store.

In summary, in addition to being more compressible and providing significant I/O savings compared to row stores, column stores are significantly more compatible with the design of modern CPUs and can at substantially improve CPU throughput for CPU-bound queries. The result is that as the gap between processor performance and memory latencies grows, column stores will continue to perform better compared to row stores. Additionally, we believe that column stores will also be better able to exploit future CPUs with dozens to hundreds of cores; however, we will reserve that discussion for a future post.


 

Categories

5 Comments

Anonymous said:

This blog should be renamed to "advocacy for column-stores," because, clearly, that's all this seems to be doing.

When this blog was started and was christened "The Database Column," it was kind of announcement that there would be regular posts on the database technology. But all there seems to be is advocacy for column-stores. Sad.

I had not heard of column-stores before this blog. I was glad to hear about them here and have been interested in them ever since. But seeing the amount of push you guys are giving to it here, I can but wonder if there is some big problem with column-stores that you are wontedly overlooking. Something so good does not need such heavy pushing.

Admin said:

Anonymous,

This blog is concerned with database technology and innovation -- the contributors, all experts in the database field (see their profiles here), happen to be big proponents of column stores for a variety of reasons that have been mentioned in existing posts and will be mentioned in future posts. They have been talking about or researching columns and discussing the value of the approach long before this blog was launched.

Obviously, column stores are not suitable for every task. Like many competing technology solutions, specific needs and environments usually favor one technology over another. Blind advocacy of columns is neither intended or delivered. For example, note that posts often have a qualifier. For example, in the recent CPU post, the author noted that "trends favor the use of a column-store architecture in database systems designed primarily for processing decision support queries." It did not say columns are the way to go in every case or even the majority of cases.

Advocating for a new, improved solution is a hallmark of the technology industry, whether it's client/server versus time share computing, DCOM versus CORBA versus Web services, or self-hosted versus software-as-a-service apps (SaaS). Proponents of the newer approach -- whether in the academic, vendor, or user world, or some combination of the three -- have strong feelings about the latest technology, even if it is not appropriate in all instances. That's what it's like with row versus columns. Would you suggest Web service or SaaS proponents not advocate for their favored approaches?

That said, we actually do have an upcoming post planned on when and where column stores don't make sense. When it sees the light of day depends on scheduling issues, but it might offer you some insights into the best use of column and row stores. We hope you continue to find value in the posts with the understanding that contributors do believe that column stores are a technology worth "heavy pushing."

Zbigniew Rozbicki said:

Does really a CPU fetch whole tuples not just attributes to L2 cache in a row store case?

Anonymous said:

Row-store with slots have slots so that index pointers dont have to change when the rows are moved around.

Questions:
1) Why do column-store columns not have to be moved around? What if we dont keep slots in the block of rows for row-store?
2) What if the attribute we were interested in, in the row-store case, was 64 bytes, aligned to the cache line size boundary with fixed length rows and we were not using slots in the block?
3) In the use-case of a column scan (SUM(emp.salary)), clearly packing all the column values together is the way to go for the performance reasons cited. What happens when there is an index on the column and there is frequent update activity on variable length column for the column-store use case?

Sameer W said:

Column based storage is a very new topic for me and this blog is my primary source of introduction.
I have read some papers on Column Based Storage and I have some questions to which I was hoping to get more clarification here.

1. Do column based storages require storing the same column multiple times based on sort attribute? Or is a column stored just once in its natural order to make storage more compact.

2. Does column based storage restrict the choices for selecting a column for sorting (based on how it is stored) or atleast makes sorting on certain columns slow from a performance standpoint?

Thanks

Leave a comment

About this Post

This page contains a single post by David DeWitt published on October 22, 2007 1:25 PM.

Thoughts on the VLDB conference was the previous entry in this blog.

Database parallelism choices greatly impact scalability is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.