<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>The Database Column</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/" />
    <link rel="self" type="application/atom+xml" href="http://www.databasecolumn.com/atom.xml" />
    <id>tag:www.databasecolumn.com,2007-08-30://1</id>
    <updated>2008-03-20T15:14:54Z</updated>
    <subtitle>A multi-author blog on database technology and innovation.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Publishing Platform 4.0</generator>

<entry>
    <title>Supporting Column Store Performance Claims</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2008/03/supporting-column-store-perfor.html" />
    <id>tag:www.databasecolumn.com,2008://1.35</id>

    <published>2008-03-14T14:58:21Z</published>
    <updated>2008-03-20T15:14:54Z</updated>

    <summary>In this post, Mike Stonebraker tackles two issues with regards to row- versus column-store databases. In the first issue, he looks at performance challenges given the demands of users. In the second issue, he discusses the availability of third-party connectivity as well as automatic database design tools.</summary>
    <author>
        <name>Michael Stonebraker</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="performance" label="performance" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="rowstores" label="row stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[We commonly encounter questions related to column store performance from those considering moving away from their current DBMS solution. In this entry, I want to share my thoughts on this topic.<br /><br /><br /><b>Issue No. 1: Addressing the Performance Claim</b><br /><br />There is a well-known adage: "If it's not broken, don't fix it." Any client who is satisfied with his current data warehouse solution would be ill-advised to change it. However, Vertica sees enormous pain in the data warehouse market, due to combinations of the following factors:<br /><br /><ol><li><b>Increasing query complexity.</b> The size of data warehouses are going up faster than disks are getting cheaper. An increasing number of people are being trained and equipped to analyze information, and they desire to correlate more and more data. Since query complexity goes up more than linearly with warehouse size, this means that warehouse problems are getting harder over time - not easier.<br /><br />Many warehouse DBAs can predict with some precision when they will "hit the wall" with their current solution. The result of hitting the wall is an expensive guided tour through the enterprise wallet for more hardware, different software, or both.<br /><br /></li><li><b>The desire for real-time warehouses.</b> Most warehouses are loaded periodically and are out of date by ½ of the length of this periodicity. But many enterprises want more timely business intelligence. The obvious solution is to "trickle load" data in parallel with user queries. However, this is impossible in many current warehouse products.<br /><br /></li><li><b>The desire for timely answers.</b> In many current products, an ad-hoc query requires one to go out to lunch before the answer is returned. Sometimes response time is even worse than this. The result of delayed answers is lost human productivity and a move to "batch thinking" rather than "interactive thinking."</li></ol><br />If the user is in serious pain with his current warehouse solution, then the obvious answer is "find a better one."<br /><br />In summary, performance is either black or white. Either it is good enough or it isn't. And if performance is important--column databases have demonstrated orders better magnitude performance (50x in round numbers) than row-stores in customer benchmarks and TPC-H benchmarks. Industry experts, such as Gartner, have validated these results (click on <a href="http://www.vertica.com/elqNow/elqRedir.htm?ref=http://www.vertica.com/product/resourcelibrary/stonebrakergartner">this link</a> to launch a Vertica-Gartner podcast on this topic<a href="http://www.databasecolumn.com/blog/mt-static/html/www.vertica.com/gartner"></a>).<br /><br />We see column databases out-perform row stores by large margins in customer benchmark settings on a frequent basis. Here are some results a customer measured very recently:<br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="benchmark_table.jpg" src="http://www.databasecolumn.com/images/2008/benchmark_table.jpg" class="mt-image-center" style="margin: 0pt auto 20px; text-align: center; display: block;" height="245" width="575" /></span>And remember -- it doesn't have to be an either-or decision. As Don Feinberg of Gartner suggests in his podcast, using a column database in conjunction with an enterprise data warehouse (EDW) can provide users with better analytic performance and also to offload certain analyses from the EDW in order to improve its performance without costly upgrades or re-designs.<br /><br /><b><br />Issue No. 2: Of Connectivity and Automatic Design Tools</b><br /><br />My second point concerns the perceived connectivity advantages of row stores. Vertica (and other column-oriented databases) use ODBC/JDBC interfaces. As such, they get connectivity to all of the 3rd party tools that row stores utilize. Hence, "connectivity" is a wash between row stores and column stores. Both kinds of products connect to most -- if not all -- of the popular tools. &nbsp;<br /><br />Lastly, there is a perception that column database introduce additional complexity for DBAs. This is untrue. Vertica includes an automatic physical database designer that helps a DBA set all of the performance options in Vertica. Hence, there is no "complexity" factor; manual optimization by a human is a thing of the past. DB2 has a similar tool. The real question is, "How good is the automatic tool from any given vendor?" We are confident in Vertica's ability to automatically generate a good physical design; it would be interesting to conduct a comparative "out-of-the-box" performance benchmark that measured automatic tool effectiveness.<br /><br /> <div><br /></div><div><br /></div>]]>
        
    </content>
</entry>

<entry>
    <title>In response to Monash&apos;s post on the four categories of RDBMS</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2008/02/responding-to-monash-2.html" />
    <id>tag:www.databasecolumn.com,2008://1.33</id>

    <published>2008-02-18T18:52:11Z</published>
    <updated>2008-03-14T15:36:09Z</updated>

    <summary>In this response to a Curt Monash post over at the DBMS2 blog, Mike Stonebraker offers his reactions. He sees two categories of relational analytic/data warehouse databases, row stores and column stores, and notes that they have very different characteristics and should not be lumped together. He also points out that if high performance is required, current high-end relational engines can be beaten by a factor of 80 or so on TPC-C.</summary>
    <author>
        <name>Michael Stonebraker</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="datawarehouse" label="data warehouse" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="databaseperformance" label="database performance" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dbms" label="DBMS" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="oltp" label="OLTP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[As I did last week, I am using this post to respond to an article published by Curt Monash. You can read his full post, titled "Database management system choices - 4 categories of relational," <a href="http://www.dbms2.com/2008/02/15/relational-database-management-categories/">here</a>. In this post, I will discuss the issue I have with Curt's category characterization of DBMS systems.<br /><br />First, I see two categories of relational analytic/data warehouse databases, row stores and column stores. They have very different characteristics. I would not lump them together, as this post does. Moreover, I expect the overwhelming majority of analytic data management workloads to move to column stores over time as these products become more mature because of the overwhelming performance advantage they offer on most analytic workloads.<br /><br />I don't know what competitive challenge to current high-end OLTP vendors Curt has in mind; however, I will offer my own. If performance is not a big issue, then current open-source relational DBMSs work quite well. As a result, I expect the "low end" to go to open source systems.<br /><br />On the other hand, if high performance is required, then I have shown in a recent paper (<a href="http://www.vldb2007.org/">2007 VLDB proceedings</a>) that current high-end relational engines can be beaten by a factor of 80 or so on TPC-C. This new collection of ideas may be leveragable into ultra-fast future commercial products that will challenge the current vendors at the high end. I think it is likely that the current vendors will be "caught in the middle."<br /><br />Lastly, most customers that I talk to are upset with the "out-of-box" experience of the current offerings from the high-end vendors. The products are hard to install, hard to tune, hard to learn, and just generally hard to use. If the products don't get much easier to use, then data administration costs will go to 100% sooner or later -- relegating these products to niche markets.<br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>Responding to Monash&apos;s recent post on diversity of database systems</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2008/02/responding-to-monash-1.html" />
    <id>tag:www.databasecolumn.com,2008://1.31</id>

    <published>2008-02-16T18:11:22Z</published>
    <updated>2008-02-16T18:39:10Z</updated>

    <summary>In this post, Mike Stonebraker comments on a post over at DBMS2 titled &quot;Database management system choices - overview.&quot; Mike makes two points. First, he offers his list of the different types of DBMSs that he sees as viable. Second, he discusses OLTP and the shared nothing architecture.</summary>
    <author>
        <name>Michael Stonebraker</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="datawarehouse" label="data warehouse" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dbms" label="DBMS" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[This week, Curt Monash published a post titled "Database management system choices - overview" on the <a href="http://www.dbms2.com/">DBMS2 blog</a> that makes the argument that in the database world, one size does not fit all. In response, I have one comment and one quibble (read the post in its entirety <a href="http://www.dbms2.com/2008/02/15/database-management-system-choices-overview/">here</a>).<br /><br /><br /><b>The comment: Many different kinds of DBMSs</b><br /><br />Curt's post leads to the obvious question: Just how many different kinds of viable DBMSs can we expect to see? I can imagine the following:<br /><br /><ol><li><b>OLTP DBMSs</b> focused on fast, reliable transaction processing<br /><b><br /></b></li><li><b>Analytic/Data Warehouse DBMSs</b> focused on efficient load and ad-hoc query performance<br /><br /></li><li><b>Science DBMSs</b> -- after all MatLab does not scale to disk-sized arrays<br /><br /></li><li><b>RDF stores</b> focused on efficiently storing semi-structured data in this format<br /><br /></li><li><b>XML stores</b> focused on semi-structured data in this format<br /><br /></li><li><b>Search engines</b> -- the big players all use proprietary engines in this area<br /><br /></li><li><b>Stream Processing Engines</b> focused on real-time StreamSQL<br /><br /></li><li><b>"Lean and Mean," less-than-a-database engines</b> focused on doing a small number of things very well (embedded databases are probably in this category)<br /><br /></li><li><b>MapReduce and Hadoop</b> -- after all Google has enough "throw weight" to define a category</li></ol><br /><br />I expect all of these to be architected differently, with the possible exception of RDF stores, which are efficiently supported on top of column stores and focused on the warehouse market.<br /><br /><br /><b>The quibble: OLTP demands shared nothing</b><br /><br />Every high-end OLTP application is currently requiring 7 x 24 x 365 x 10 years of availability. That is, the database has only one state, which is "up."&nbsp; Hence, high availability -- in the face of crashes as well as disasters -- is a requirement. Disaster recovery requires replication over a wide area network; recovery from crashes, requires LAN-based replication. As such, every OLTP system is, in fact, deployed over a shared-nothing architecture, encompassing both LAN and WAN networking.<br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>INSERT performance in column stores</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2008/02/insert-performance.html" />
    <id>tag:www.databasecolumn.com,2008://1.30</id>

    <published>2008-02-06T14:50:51Z</published>
    <updated>2008-03-03T16:13:43Z</updated>

    <summary>In this post, Stan Zdonik examines the issue of INSERT performance in column stores. By implementing certain strategies, he notes that it is possible to have a column store with INSERT performance that is at least competitive in performance with that of the major row stores.</summary>
    <author>
        <name>Stan Zdonik</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="insert" label="INSERT" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="zdonik" label="Zdonik" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[The most common question I am asked about column stores is, "Isn't INSERT performance poor?" The rationale for this question stems from the fact that in a column store a new tuple must be 1) split into its component column values, and 2) each such value must then be written to a different place (file). This would seemingly result in writing a large number of different disk blocks for every insertion. Furthermore, if the physical representation of the column is sorted and compressed, preserving this will only add to the overhead of an INSERT. While there is some truth to this line of reasoning, the problem can be overcome with the proper implementation.<br /><br /><b><br />Overcoming the INSERT performance penalty</b><br /><br />One approach to significantly mitigating the performance problem is to batch INSERTs and to perform the sorts, compression, and disk writes in large groups. By doing this, existing data is kept on disk in its sorted and compressed form and new tuples are batched in a separate memory space or cache. This cache is maintained as columns in insertion order. Periodically, an asynchronous process runs and writes a batch of tuples, merging them into the disk-based storage system. In this way, the performance cost of sorting and compression is shared and amortized across many tuples. The expected number of disk writes per INSERT will also decrease as the batch size grows. Further, if the insertion-order cache is stored in main memory, this structure can be quickly scanned.<br /><br />A fair question to ask is whether this approach would mean that the answers to queries would be stale. The answer is no ... if the query evaluator looks in both places (the disk and the cache). To do so would require the query optimizer to generate two plans since the data is structured differently in each location, but the extra query planning work is worth the trouble because of the boost in INSERT performance.<br /><br />It should also be pointed out that column stores can partition tables across a collection of shared-nothing nodes in a cluster. If INSERTs are randomly distributed on the partitioning key, then the load introduced by high INSERT rates is distributed evenly across the cluster. If the INSERT rate grows, more nodes can be added to cope with the increase.<br /><br /><br /><b>A note about ACID</b><br /><br />Of course, in order to support <a href="http://en.wikipedia.org/wiki/ACID">ACID</a> transactions in this setting, there must be a safe way to allow committed data to reside in main memory. This can be accomplished by keeping redundant copies in multiple distributed main memories. In general, one can achieve k-safety, where k is the number of nodes that can fail without losing any work, by keeping data copies on k+1 different machines. All INSERTs will be sent to all k+1 relevant sites and stored in their main memory caches. Once all these copies are installed, the tuple is stable (subject to the k-safety constraints).<br /><br /><br /><b>INSERT Performance Benchmarks</b><br /><br />By implementing all these strategies, it is possible to have a column store with INSERT performance that is at least competitive in performance with that of the major row stores. In fact, in many cases, benchmarks have shown that load performance for a column store is typically better than that of a row store.<br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>MapReduce II</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2008/01/mapreduce-continued.html" />
    <id>tag:www.databasecolumn.com,2008://1.29</id>

    <published>2008-01-25T19:56:04Z</published>
    <updated>2008-02-18T19:38:54Z</updated>

    <summary>In this follow up post, David DeWitt and Michael Stonebraker discuss the feedback from their previous post on MapReduce. They focus on four criticisms of their first article: 1) that MapReduce is not a database system and should not be judged as one; 2) that MapReduce has excellent scalability, demonstrated by Google&apos;s use; 3) that MapReduce is cheap compared to high-end DBMS solutions; 4) and that their stance was the result of DBMS &quot;gray beards&quot; trying to defend their turf/legacy from the MapReduce &quot;young turks.&quot;</summary>
    <author>
        <name>David DeWitt</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="dbms" label="DBMS" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dewitt" label="DeWitt" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="mapreduce" label="MapReduce" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[<i>[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]</i><br /><br /><a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html">Last week's MapReduce post</a> attracted tens of thousands of readers and generated many comments, almost all of them attacking our critique. Just to let you know, we don't hold a personal grudge against MapReduce. MapReduce didn't kill our dog, steal our car, or try and date our daughters. <br /><br />Our motivations for writing about MapReduce stem from MapReduce being increasingly seen as the most advanced and/or only way to analyze massive datasets. Advocates promote the tool without seemingly paying attention to years of academic and commercial database research and real world use. <br /><br />The point of our initial post was to say that there are striking similarities between MapReduce and a fairly primitive parallel database system. As such, MapReduce can be significantly improved by learning from the parallel database community.<br /><br />So, hold off on your comments for just a few minutes, as we will spend the rest of this post addressing four specific topics brought up repeatedly by those who commented on our previous blog: <br /><br /><ol><li>MapReduce is not a database system, so don't judge it as one<br /><br /></li><li>MapReduce has excellent scalability; the proof is Google's use<br /><br /></li><li>MapReduce is cheap and databases are expensive<br /><br /></li><li>We are the old guard trying to defend our turf/legacy from the young turks</li></ol><br /><br /><b>Feedback No. 1: MapReduce is not a database system, so don't judge it as one</b><br /><br />It's not that we don't understand this viewpoint. We are not claiming that MapReduce is a database system. What we are saying is that like a DBMS + SQL + analysis tools, MapReduce can be and is being used to analyze and perform computations on massive datasets. So we aren't judging apples and oranges. We are judging two approaches to analyzing massive amounts of information, even for less structured information. <br /><br />To illustrate our point, assume that you have two very large files of facts. The first file contains structured records of the form: <br /><br /><blockquote>Rankings (pageURL, pageRank)<br /></blockquote><br />Records in the second file have the form: <br /><br /><blockquote>UserVisits (sourceIPAddr, destinationURL, date, adRevenue)<br /></blockquote><br />Someone might ask, "What IP address generated the most ad revenue during the week of January 15th to the 22nd, and what was the average page rank of the pages visited?"<br /><br />This question is a little tricky to answer in MapReduce because it consumes two data sets rather than one, and it requires a "join" of the two datasets to find pairs of Ranking and UserVisit records that have matching values for pageURL and destinationURL. In fact, it appears to require three MapReduce phases, as noted below.<br /><br /><blockquote><b>Phase 1</b><br /><br />This phase filters UserVisits records that are outside the desired data range and then "joins" the qualifying records with records from the Rankings file. <br /><br /><ul><li><b>Map program:</b> The map program scans through UserVisits and Rankings records. Each UserVisit record is filtered on the date range specification. Qualifying records are emitted with composite keys of the form &lt;destinationURL, T1 &gt; where T1 indicates that it is a UserVisits record. Rankings records are emitted with composite keys of the form &lt;pageURL, T2 &gt;&nbsp; (T2 is a tag indicating it a Rankings record). Output records are repartitioned using a user-supplied partitioning function that only hashes on the URL portion of the composite key.<br /><br /></li><li><b>Reduce Program: </b>The input to the reduce program is a single sorted run of records in URL order. For each unique URL, the program splits the incoming records into two sets (one for Rankings records and one for UserVisits records) using the tag component of the composite key. To complete the join, reduce finds all matching pairs of records of the two sets. Output records are in the form of Temp1 (sourceIPAddr, pageURL, pageRank, adRevenue).&nbsp; </li></ul><br />The reduce program must be capable of handling the case in which one or both of these sets with the same URL are too large to fit into memory and must be materialized on disk. Since access to these sets is through an iterator, a straightforward implementation will result in what is termed a nested-loops join. This join algorithm is known to have very bad performance I/O characteristics as "inner" set is scanned once for each record of the "outer" set.<br /><br /><br /><b>Phase 2</b><br /><br />This phase computes the total ad revenue and average page rank for each Source IP Address.<br /><b><br /></b><ul><li><b>Map program:</b> Scan Temp1 using the identity function on sourceIPAddr.<br /><br /></li><li><b>Reduce program:</b> The reduce program makes a linear pass over the data. For each sourceIPAddr, it will sum the ad-revenue and compute the average page rank, retaining the one with the maximum total ad revenue. Each reduce worker then outputs a single record of the form Temp2 (sourceIPAddr,&nbsp; total_adRevenue, average_pageRank).</li></ul><br /><b>Phase 3</b><br /><br /><ul><li><b>Map program:</b> The program uses a single map worker that scans Temp2 and outputs the record with the maximum value for total_adRevenue. </li></ul></blockquote><br />We realize that portions of the processing steps described above are handled automatically by the MapReduce infrastructure (e.g., sorting and partitioning the records). Although we have not written this program, we estimate that the custom parts of the code (i.e., the map() and reduce() functions) would require substantially more code than the two fairly simple SQL statements to do the same:<br /><br /><blockquote><b>Q1</b><br /><br />Select as Temp&nbsp; sourceIPAddr, avg(pageRank) as avgPR, sum(adRevenue) as adTotal<br />From Rankings, UserVisits <br />where Rankings.pageURL = UserVisits.destinationURL and<br />date &gt; "Jan 14" and date &lt; "Jan 23" <br />Group by sourceIPAddr<br /><br /><br /><b>Q2</b><br /><br />Select sourceIPAddr, adTotal, avgPR<br />From Temp<br />Where adTotal = max (adTotal)<br /></blockquote><br />No matter what you think of SQL, eight lines of code is almost certainly easier to write and debug than the programming required for MapReduce. We believe that MapReduce advocates should consider the advantages that layering a high-level language like SQL could provide to users of MapReduce. Apparently we're not alone in this assessment, as efforts such as PigLatin and Sawzall appear to be promising steps in this direction. <br /><br />We also firmly believe that augmenting the input files with a schema would provide the basis for improving the overall performance of MapReduce applications by allowing B-trees to be created on the input data sets and techniques like hash partitioning to be applied. These are technologies in widespread practice in today's parallel DBMSs, of which there are quite a number on the market, including ones from IBM, Teradata, Netezza, Greenplum, Oracle, and Vertica. All of these should be able to execute this program with the same or better scalability and performance of MapReduce.<br /><br />Here's how these capabilities could benefit MapReduce:<br /><br /><blockquote><ol><li><b>Indexing.</b> The filter (date &gt; "Jan 14" and date &lt; "Jan 23") condition can be executed by using a B-tree index on the date attribute of the UserVisits table, avoiding a sequential scan of the entire table.<br /><br /> </li><li><b>Data movement.</b> When you load files into a distributed file system prior to running MapReduce, data items are typically assigned to blocks/partitions in sequential order. As records are loaded into a table in a parallel database system, it is standard practice to apply a hash function to an attribute value to determine which node the record should be stored on (the same basic idea as is used to determine which reduce worker should get an output record from a map instance). For example, records being loaded into the Rankings and UserVisits tables might be mapped to a node by hashing on the pageURL and destinationURL attributes, respectively. If loaded this way, the join of Rankings and UserVisits in Q1 above would be performed completely locally <i>with absolutely no data movement between nodes</i>. Furthermore, as result records from the join are materialized, they will be pipelined directly into a local aggregate computation without being written first to disk. This local aggregate operator will partially compute the two aggregates (sum and average) concurrently (what is called a combiner in MapReduce terminology). These partial aggregates are then repartitioned by hashing on this sourceIPAddr to produce the final results for Q1.<br /><br />It is certainly the case that you could do the same thing in MapReduce by using hashing to map records to chunks of the file and then modifying the MapReduce program to exploit the knowledge of how the data was loaded. But in a database, physical data independence happens automatically. When Q1 is "compiled," the query optimizer will extract partitioning information about the two tables from the schema.&nbsp; It will then generate the correct query plan based on this partitioning information (e.g., maybe Rankings is hash partitioned on pageURL but UserVisits is hash partitioned on sourceIPAddr). This happens transparently to any user (modulo changes in response time) who submits a query involving a join of the two tables. <br /><b><br /></b> </li><li><b>Column representation.</b> Many questions access only a subset of the fields of the input files. The others do not need to be read by a column store.<br /><br /> </li><li><b>Push, not pull.</b> MapReduce relies on the materialization of the output files from the map phase on disk for fault tolerance. Parallel database systems push the intermediate files directly to the receiving (i.e., reduce) nodes, avoiding writing the intermediate results and then reading them back as they are pulled by the reduce computation. This provides MapReduce far superior fault tolerance at the expense of additional I/Os.&nbsp; </li></ol></blockquote><br />In general, we expect these mechanisms to provide about a factor of 10 to 100 performance advantage, depending on the selectivity of the query, the width of the input records to the map computation, and the size of the output files from the map phase. As such, we believe that 10 to 100 parallel database nodes can do the work of 1,000 MapReduce nodes. <br /><br />To further illustrate out point, suppose you have a more general filter, F, a more general group_by function, G, and a more general Reduce function, R. PostgreSQL (an open source, free DBMS) allows the following SQL query over a table T:<br /><br /><blockquote>Select R (T)<br />From T<br />Group_by G (T)<br />Where F (T)<br /></blockquote><br />F, R, and G can be written in a general-purpose language like C or C++. A SQL engine, extended with user-defined functions and aggregates, has nearly -- if not all -- of the generality of MapReduce.&nbsp;&nbsp; <br /><br />As such, we claim that <i>most things that are possible in MapReduce are also possible in a SQL engine</i>. Hence, it is exactly appropriate to compare the two approaches. We are working on a more complete paper that demonstrates the relative performance and relative programming effort between the two approaches, so, stay tuned.&nbsp;&nbsp; <br /><br /><b><br />Feedback No. 2: MapReduce has excellent scalability; the proof is Google's use</b><br /><br />Many readers took offense at our comment about scaling and asserted that since Google runs MapReduce programs on 1,000s (perhaps 10s of 1,000s) of nodes it must scale well. Having started benchmarking database systems 25 years ago (yes, in 1983), we believe in a more scientific approach toward evaluating the scalability of any system for data intensive applications.<br /><br />Consider the following scenario. Assume that you have a 1 TB data set that has been partitioned across 100 nodes of a cluster (each node will have about 10 GB of data). Further assume that some MapReduce computation runs in 5 minutes if 100 nodes are used for both the map and reduce phases. Now scale the dataset to 10 TB, partition it over 1,000 nodes, and run the same MapReduce computation using those 1,000 nodes. If the performance of MapReduce scales linearly, it will execute the same computation on 10x the amount of data using 10x more hardware in the same 5 minutes. <i>Linear scaleup is the gold standard for measuring the scalability of data intensive applications</i>. As far as we are aware there are no published papers that study the scalability of MapReduce in a controlled scientific fashion. MapReduce may indeed scale linearly, but we have not seen published evidence of this.&nbsp;&nbsp;&nbsp; <br /><br /><b><br />Feedback No. 3: MapReduce is cheap and databases are expensive</b><br /><br />Every organization has a "build" versus "buy" decision, and we don't question the decision by Google to roll its own data analysis solution. We also don't intend to defend DBMS pricing by the commercial vendors. What we wanted to point out is that we believe it is possible to build a version of MapReduce with more functionality and better performance. Pig is an excellent step in this direction. <br /><br />Also, we want to mention that there are several open source (i.e., free) DBMSs, including PostgreSQL, MySQL, Ingres, and BerkeleyDB. Several of the aforementioned parallel DBMS companies have increased the scale of these open source systems by adding parallel computing extensions.<br /><br />A number of individuals also commented that SQL and the relational data model are too restrictive. Indeed, the relational data model might very well be the wrong data model for the types of datasets that MapReduce applications are targeting. However, there is considerable ground between the relational data model and no data model at all. The point we were trying to make is that developers writing business applications have benefited significantly from the notion of organizing data in the database according to a data model and accessing that data through a declarative query language. We don't care what that language or model is. Pig, for example, employs a nested relational model, which gives developers more flexibility that a traditional 1NF doesn't allow.<br /><br /><br /><b>Feedback No. 4: We are the old guard trying to defend our turf/legacy from the young turks</b><br /><br />Since both of us are among the "gray beards" and have been on this earth about 2 Giga-seconds, we have seen a lot of ideas come and go. We are constantly struck by the following two observations:<br /><br /><ul><li><b>How insular computer science is.</b> The propagation of ideas from sub-discipline to sub-discipline is very slow and sketchy. Most of us are content to do our own thing, rather than learn what other sub-disciplines have to offer.<br /><br /></li><li><b>How little knowledge is passed from generation to generation.</b> In a recent paper entitled "What goes around comes around," (M. Stonebraker/J. Hellerstein, Readings in Database Systems 4th edition, MIT Press, 2004) one of us noted that many current database ideas were tried a quarter of a century ago and discarded. However, such pragma does not seem to be passed down from the "gray beards" to the "young turks."&nbsp; The turks and gray beards aren't usually and shouldn't be adversaries. </li></ul><br />Thanks for stopping by the "pasture" and reading this post. We look forward to reading your feedback, comments and alternative viewpoints.<br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>MapReduce: A major step backwards</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html" />
    <id>tag:www.databasecolumn.com,2008://1.28</id>

    <published>2008-01-17T21:20:43Z</published>
    <updated>2008-02-18T03:36:25Z</updated>

    <summary>In this post, David DeWitt and Michael Stonebraker discuss MapReduce. While it may be a good idea for writing certain types of general-purpose computations, they believe it is a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel, as it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; and incompatible with all of the tools DBMS users have come to depend on.</summary>
    <author>
        <name>David DeWitt</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database history" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="databaseperformance" label="database performance" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dewitt" label="DeWitt" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="mapreduce" label="MapReduce" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[<i>[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]</i><br /><br />On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we'll begin here with our views on <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a>. This is a good time to discuss it, since the recent trade press has been filled with news of the revolution of so-called "cloud computing." This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of "jelly beans" rather than utilizing a much smaller number of high-end servers.<br /><br />For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework.<br /><br />As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:<br /><br /><ol><li>A giant step backward in the programming paradigm for large-scale data intensive applications<br /><br /></li><li>A sub-optimal implementation, in that it uses brute force instead of indexing<br /><br /></li><li>Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago<br /><br /></li><li>Missing most of the features that are routinely included in current DBMS<br /><br /></li><li>Incompatible with all of the tools DBMS users have come to depend on<br /></li></ol><br />First, we will briefly discuss what MapReduce is; then we will go into more detail about our five reactions listed above.<br /><br /><br /><b>What is MapReduce?</b><br /><br />The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called <i>map</i> and <i>reduce</i> plus a framework for executing a possibly large number of instances of each program on a compute cluster.&nbsp;&nbsp; <br /><br />The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into <i>M</i> disjoint buckets by applying a function to the key of each output record.&nbsp;&nbsp; This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with <i>M</i> output files, one for each bucket.<br /><br />In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If <i>N</i> nodes participate in the map phase, then there are <i>M</i> files on disk storage at each of <i>N</i> nodes, for a total of <i>N</i> * <i>M</i> files; <i>F<sub>i,j</sub></i>,&nbsp; 1 ≤ <i>i</i> ≤ <i>N</i>,&nbsp; 1 ≤ <i>j</i> ≤ <i>M</i>.<br /><br />The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.&nbsp; <br /><br />The second phase of a MapReduce job executes <i>M</i> instances of the reduce program, <i>R<sub>j</sub></i>, 1 ≤ <i>j</i> ≤ <i>M</i>.&nbsp; The input for each reduce instance <i>R<sub>j</sub></i> consists of the files <i>F<sub>i,j</sub></i>,&nbsp; 1 ≤ <i>i</i> ≤ <i>N</i>.&nbsp; Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance -- no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the "answer" to a MapReduce computation.<br /><br />To draw an analogy to SQL, map is like the <i>group-by</i> clause of an aggregate query. Reduce is analogous to the <i>aggregate</i> function (e.g., average) that is computed over all the rows with the same group-by attribute.<br /><br />We now turn to the five concerns we have with this computing paradigm.<br /><br /><br /><b>1. MapReduce is a step backwards in database access</b><br /><br />As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.<br /><br /><ul><li>Schemas are good.<br /><br /></li><li>Separation of the schema from the application is good.<br /><br /></li><li>High-level access languages are good.<br /></li></ul><br />MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented.<br /><br />The DBMS community learned the importance of schemas, whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset.<br /><br />It is also crucial to separate the schema from the application program. If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure. In contrast, when the schema does not exist or is buried in an application program, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. This latter tedium is forced onto every MapReduce programmer, since there are no system catalogs recording the structure of records -- if any such structure exists.<br /><br />During the 1970s the DBMS community engaged in a "great debate" between the relational advocates and the Codasyl advocates. One of the key issues was whether a DBMS access program should be written:<br /><br /><ul><li>By stating what you want - rather than presenting an algorithm for how to get it (relational view)<br /><br /></li><li>By presenting an algorithm for data access (Codasyl view)<br /></li></ul><br />The result is now ancient history, but the entire world saw the value of high-level languages and relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and easier for a new person to understand. Codasyl was rightly criticized for being "the assembly language of DBMS access." A MapReduce programmer is analogous to a Codasyl programmer -- he or she is writing in a low-level language performing low-level record manipulation. Nobody advocates returning to assembly language; similarly nobody should be forced to program in MapReduce.<br /><br />MapReduce advocates might counter this argument by claiming that the datasets they are targeting have no schema. We dismiss this assertion. In extracting a key from the input data set, the map function is relying on the existence of at least one data field in each input record. The same holds for a reduce function that computes some value from the records it receives to process.&nbsp;&nbsp; <br /><br />Writing MapReduce applications on top of Google's BigTable (or Hadoop's HBase) does not really change the situation significantly. By using a self-describing tuple format (row key, column name, {values}) different tuples within the same table can actually have different schemas. In addition, BigTable and HBase do not provide logical independence, for example with a view mechanism. Views significantly simplify keeping applications running when the logical schema changes.<br /><br /><b><br />2. MapReduce is a poor implementation</b><br /><br />All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one can often use an index to advantage to cut down the scope of the search by one to two orders of magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.<br /><br />MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.<br /><br />One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3],&nbsp; Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.&nbsp; <br /><br />In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.&nbsp; <br /><br />There are also some lower-level implementation issues with MapReduce, specifically skew and data interchange.<br /><br />One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in "Parallel Database System: The Future of High Performance Database Systems," [6] skew is a huge impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the MapReduce community might want to adopt.<br /><br />There is a second serious performance problem that gets glossed over by the MapReduce proponents. Recall that each of the <i>N</i> map instances produces <i>M</i> output files -- each destined for a different reduce instance. These files are written to a disk local to the computer used to run the map instance. If <i>N</i> is 1,000 and <i>M</i> is 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to "pull" each of its input files from the nodes on which the map instances were run. With 100s of reduce instances running simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files from the same map node simultaneously -- inducing large numbers of disk seeks and slowing the effective disk transfer rate by more than a factor of 20. This is why parallel database systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the excellent fault-tolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the MapReduce framework could be successfully modified to use the push paradigm instead.<br /><br />Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.<br /><br /><br /><b>3. MapReduce is not novel</b><br /><br />The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old. The idea of partitioning a large data set into smaller partitions was first proposed <span>in "Application of Hash to Data Base Machine and Its Architecture" [11]</span> as the basis for a new type of join algorithm. In "Multiprocessor Hash-Based Join Algorithms," [7], Gerber demonstrated how Kitsuregawa's techniques could be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in parallel.&nbsp;&nbsp; <br /><br />Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented.&nbsp;&nbsp; <br /><br />While MapReduce advocates will undoubtedly assert that being able to write MapReduce functions is what differentiates their software from a parallel SQL implementation, we would remind them that POSTGRES supported user-defined functions and user-defined aggregates in the mid 1980s. Essentially, all modern database systems have provided such functionality for quite a while, starting with the Illustra engine around 1995.&nbsp; <br /><br /><br /><b>4.&nbsp; MapReduce is missing features</b><br /><br />All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce:<br /><br /><ul><li><b>Bulk loader</b> -- to transform input data in files into a desired format and load it into a DBMS<br /><br /></li><li><b>Indexing</b> -- as noted above<br /><br /></li><li><b>Updates</b> -- to change the data in the data base<br /><br /></li><li><b>Transactions</b> -- to support parallel update and recovery from failures during update<br /><br /></li><li><b>Integrity constraints</b> -- to help keep garbage out of the data base<br /><br /></li><li><b>Referential integrity</b> -- again, to help keep garbage out of the data base<br /><br /></li><li><b>Views</b> -- so the schema can change without having to rewrite the application program<br /></li></ul><br />In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs.<br /><br /><b><br />5.&nbsp; MapReduce is incompatible with the DBMS tools</b> <br /><br />A modern SQL DBMS has available all of the following classes of tools:<br /><br /><ul><li><b>Report writers</b> (e.g., Crystal reports) to prepare reports for human visualization<br /><br /></li><li><b>Business intelligence tools</b> (e.g., Business Objects or Cognos) to enable ad-hoc querying of large data warehouses<br /><br /></li><li><b>Data mining tools</b> (e.g., Oracle Data Mining or IBM DB2 Intelligent Miner) to allow a user to discover structure in large data sets<br /><br /></li><li><b>Replication tools</b> (e.g., Golden Gate) to allow a user to replicate data from on DBMS to another<br /><br /></li><li><b>Database design tools</b> (e.g., Embarcadero) to assist the user in constructing a data base.<br /></li></ul><br />MapReduce cannot use these tools and has none of its own. Until it becomes SQL-compatible or until someone writes all of these tools, MapReduce will remain very difficult to use in an end-to-end task.<br /><br /><br /><b>In Summary</b><br /><br />It is exciting to see a much larger community engaged in the design and implementation of scalable query processing techniques. We, however, assert that they should not overlook the lessons of more than 40 years of database technology -- in particular the many advantages that a data model, physical and logical data independence, and a declarative query language, such as SQL, bring to the design, implementation, and maintenance of application programs. Moreover, computer science communities tend to be insular and do not read the literature of other communities. We would encourage the wider community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can measure up to modern DBMSs, there is a large collection of unmet features and required tools that must be added.<br /><br />We fully understand that database systems are not without their problems. The database community recognizes that database systems are too "hard" to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finally we note that some database researchers are beginning to explore using the MapReduce framework as the basis for building scalable database systems. The Pig[10] project at Yahoo! Research is one such effort.<br /><br /><br /><b><br />References</b> <br /><br />[1] "MapReduce:&nbsp; Simplified Data Processing on Large Clusters," Jeff Dean and Sanjay Ghemawat, Proceedings of the 2004 OSDI Conference, 2004.<br /><br />[2] "The Gamma Database Machine Project," DeWitt, et. al., IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990.<br /><br />[4] "Gamma - A High Performance Dataflow Database Machine,"&nbsp; DeWitt, D, R. Gerber, G. Graefe,&nbsp; M. Heytens, K. Kumar, and M. Muralikrishna,&nbsp; Proceedings of the 1986 VLDB Conference,&nbsp; 1986.<br /><br />[5] "Prototyping Bubba, A Highly Parallel Database System," Boral, et. al., IEEE Transactions on Knowledge and Data Engineering,Vol. 2, No. 1, March 1990.<br /><br />[6] "Parallel Database System: The Future of High Performance Database Systems," David J. DeWitt and Jim Gray,&nbsp; CACM,&nbsp; Vol. 35, No. 6,&nbsp; June 1992.<br /><br />[7] "Multiprocessor Hash-Based Join Algorithms," David J. DeWitt and&nbsp; Robert H. Gerber,&nbsp; Proceedings of the 1985 VLDB Conference, 1985.<br /><br />[8] "The Case for Shared-Nothing," Michael Stonebraker,&nbsp; Data Engineering Bulletin, Vol. 9, No. 1, 1986.<br /><br />[9] "Adaptive Parallel Aggregation Algorithms," Ambuj Shatdal and Jeffrey F. Naughton,&nbsp;&nbsp; Proceedings of the 1995 SIGMOD Conference,&nbsp; 1995.<br /><br />[10] "Pig", Chris Olston, http://research.yahoo.com/project/90<br /><br /><span>[11] "Application of Hash to Data Base Machine and Its
Architecture," Masaru Kitsuregawa, Hidehiko Tanaka, Tohru Moto-Oka,
New Generation Comput. 1(1): 63-74 (1983)</span><br /><span></span><br />]]>
        
    </content>
</entry>

<entry>
    <title>Relational databases for storing and querying RDF</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2008/01/databases-and-rdf.html" />
    <id>tag:www.databasecolumn.com,2008://1.27</id>

    <published>2008-01-09T22:29:21Z</published>
    <updated>2008-03-03T16:15:09Z</updated>

    <summary>The Resource Description Format (RDF) is a way to describe information about relationships between entities and objects. It was originally developed by the W3C as a way to describe information about resources on the Web. It is intended to be the data model used in the Semantic Web, where web pages contain not just text but also structured records describing the data they contain and the relationships in that data. In this post, Sam Madden and Daniel Abadi discuss RDF and database issues.</summary>
    <author>
        <name>Sam Madden</name>
        
    </author>
    
        <category term="Database miscellaneous" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="abadi" label="Abadi" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="madden" label="Madden" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="rdf" label="RDF" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[The Resource Description Format (RDF) is a way to describe information about relationships between entities and objects. It was originally developed by the W3C as a way to describe information about resources on the Web. It is intended to be the data model used in the <a href="http://www.w3.org/2001/sw/">Semantic Web</a>, where web pages contain not just text but also structured records describing the data they contain and the relationships in that data.<br /><br />RDF has seen widespread adoption in recent years. For example, the entire MIT library catalog is available in RDF format. More recently, a number of biology researchers have begun to publish their data in RDF, including the <a href="http://dev.isb-sib.ch/projects/uniprot-rdf/">UniProt</a> comprehensive catalog of protein sequence, function, and annotation data.<br /><br /><br /><b>Understanding RDF</b><br /><br />An RDF document consists of a collection of statements of the form subject-property-object. For example, a library database that stores data about authors and books might have statement triples like "User1 has-name 'Sam Madden'", "User1 is-an Author", "User1 wrote Book1", "Book1 is-a Book", "Book1 has-title 'Who ate my cheese?'", etc., as shown in the "Triples Representation" on the top of the figure below.<br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="rdf_table.jpg" src="http://www.databasecolumn.com/images/2008/rdf_table.jpg" class="mt-image-center" style="margin: 0pt auto 20px; text-align: center; display: block;" height="378" width="469" /></span>It should be clear that an RDF document, containing a collection of triples about a group of resources, is a structured database that users may want to browse, search, or query in a number of ways. Building tools that make it possible do this efficiently is one of the goals of our research. In particular, we are interested in the performance of different on-disk storage representations for a collection of triples.<br /><br /><br /><b>Designing tools to handle RDF efficiently</b><br /><br />Our first attempts to do this have focused on leveraging relational database technology. The obvious relational representation of an RDF document is as a table with three columns, which would conventionally be stored as a series of 3-tuples laid out on disk in a row-major format. This representation, however, performs quite poorly for many types of queries. Suppose, for example, we want to find all the authors of the book "Who ate my cheese".&nbsp; We will first have to find the triple "bookM has-title 'Who ate my cheese'". We will then have to perform a self join with the triples table to find all of the triples of the form "personN wrote bookM'. Finally, for each author, we will have to perform another self join to find triples of the form 'personN has-name 'Sam Madden'". <br /><br />Hence, we have been looking at alternative representations that eliminate these self joins (we still expose a logical model of a collection of triples that the user queries, but we transform user queries to apply to our modified physical representation.) For example, one possible representation is to store a table where the first column contains the subject, and each additional column corresponds to a particular property. This representation is sometimes called a "property representation", as shown on the bottom of the figure above.&nbsp; Though this representation can have many NULL values if there are a variety of subjects with diverse properties defined, it has the advantage that all of the properties of a given object are now stored together.<br /><br />Our work in this area, "<a href="http://www.vldb.org/conf/2007/papers/research/p411-abadi.pdf">Scalable Semantic Web Data Management Using Vertical Partitioning</a>," appeared in the VLDB Conference in Vienna in September. It showed that using a column-oriented database, along with this property representation, allows us to overcome the overhead of representing NULLs, while providing two orders of magnitude better performance than the naive triples representation. This is particularly true when processing queries that must access many triples during execution (e.g., computing the number of books grouped by subject area or institution.) Of course, there is a fair amount of subtlety to getting good performance out of such a representation. Have a look at our conference paper for the details!<br /><br /><br /><b>Caveats for column- and row-store databases</b><br /><br />As we've discussed elsewhere in this blog, column-stores can perform worse than row-stores for certain classes of queries. In particular, for lookups of a single record (e.g., all of the information about a particular author), a row-oriented database (using a property representation) may outperform a column-oriented system. This is because it only has to seek to one location on disk to read the data from this record, whereas a column store will have to seek to each column to reconstruct the entire record. <br /><br />There are other situations where neither a row- nor column-oriented property representation is ideal. Imagine, for example, a user browsing an RDF-based Web site containing our library database. During browsing, suppose the user navigates from books or articles, to authors, to related books and articles, and so on. Such browsing queries in a property representation will lead to (slow) self-joins on the property table, just as they did in the triples table. Hence, a more sensible representation for a browsing-oriented database would be to store a given record R near to records the user is likely to navigate to from R. This is the topic of our current research in this area.<br /><br /><i>* Editors note: While this post will show up in the blog as written by Sam Madden, it has two authors: Samuel Madden (MIT) and Daniel Abadi (Yale)</i><br /> <div><br /></div><div><br /></div>]]>
        
    </content>
</entry>

<entry>
    <title>The Database Column in 2008: Building on initial success</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2008/01/database-column-in-2008.html" />
    <id>tag:www.databasecolumn.com,2008://1.26</id>

    <published>2008-01-08T20:06:15Z</published>
    <updated>2008-01-08T20:13:09Z</updated>

    <summary>With 2007 now in the books, all of us affiliated with the Database Column blog want to thank you for your readership and thoughtful commentary. There are many topics in the publishing queue, but we want to make sure we are covering topics that matter to readers. We encourage you to send us your questions, comments, and ideas for new topics.</summary>
    <author>
        <name>Admin</name>
        
    </author>
    
        <category term="About Database Column" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="vertica" label="Vertica" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[With 2007 now in the books, all of us affiliated with the Database Column blog want to thank you for your readership and thoughtful commentary. We launched the blog late last year with the goal of generating discussion around cutting-edge database issues, with interest driven by the posts of many of movers and shakers in the database community. <br /><br />We knew going into this venture that we had to be cognizant about the Vertica and column store influence of the blog, and some readers have provided feedback about their perception of the blog. Rest assured that while all of the main contributors are affiliated with Vertica, we spend a lot of time trying to ensure that the blog does not become a marketing mouthpiece for the company.<br /><br /><br /><b>Readership: Keep checking back and spread the word</b><br /><br />The number of weekly readers for the Database Column blog continues to increase, and we hope that the original, thoughtful content provided by the experts at the blog will continue to attract more readers. More readers means more interactive discussion (feedback and subsequent posts), inspires the contributors to write more, and we hope elevates the overall level of community discussion about database technology. If you know of someone who might be interested in the content here, or you know of other good database-related sites, send them the URL or write a comment.<br /><br /><br /><b>Feedback: Thanks, and a call for more</b><br /><br />On the subject of feedback, we have been excited by the readership's interest in the posts and we appreciate those that have taken the extra time to publish comments. We encourage feedback -- even critical feedback -- and hope the trend continues. We try and respond to many of these comments, and, in fact, some of the comments have inspired full posts in the following weeks.<br /><br />One important change we have made is how we handle comments to the Database Column. We experimented with a few different mechanisms for allowing you to comment. Some of the methods required registration -- a process that was not always easy or not fully functioning. As a result of these early issues, from now on, we no longer require registration to post a comment. However, we will review each comment before it goes live. We receive hundreds of blog spam messages in a week, so this revised process should enable you to easily comment <i>and</i> avoid an avalanche of spam.<br /><br /><br /><b>Content: Upcoming posts and a call for topics</b><br /><br />There are many topics in the publishing queue, but we want to make sure we are covering topics that matter to readers. What can you expect from us? Upcoming posts will include (these are working titles): "Storing and querying RDF," "What column stores are not good for," and the first in a series of open conversations with Curt Monash on the viability of a one-size-fits-all database approach. That said, we encourage you to send us your questions, comments, and ideas for new topics. We scan all comments for possible topics and we will look at the comments for this post.<br /><br />Thanks again for your interest and contributions. Best wishes for a prosperous 2008!<br /><br />&nbsp;- The Database Column Editors &amp; Contributors<br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>To ETL or federate ... that is the question</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2007/12/to-etl-or-federate.html" />
    <id>tag:www.databasecolumn.com,2007://1.25</id>

    <published>2007-12-17T22:58:20Z</published>
    <updated>2008-02-19T19:19:13Z</updated>

    <summary>Enterprises must integrate data in a number of operational systems. But how should they do it? There are two technical approaches: ETL or Federate. Michael Stonebraker discusses the pros and cons of each approach in regards to data element &quot;heat,&quot; indexing, resource management, complexity of schema change, contention, timeliness, and mapping,  concluding that the ETL approach makes sense in most cases.</summary>
    <author>
        <name>Michael Stonebraker</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="businessintelligence" label="business intelligence" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="etl" label="ETL" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="federation" label="federation" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="indexing" label="indexing" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="oltp" label="OLTP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="schema" label="schema" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[It's a common problem. Enterprises must integrate data in a number of operational systems. But how should they do it? There are two technical approaches:<br /><br /><ul><li><b>Extract, transform, and load (ETL).</b> In this approach, an enterprise sets up a centralized data warehouse and then constructs a global schema for the data of interest. For each operational system, they will employ some sort of ETL process to transform data instances into the global schema and then load them into the centralized warehouse.<br /><br /></li><li><b>Federate.</b> As an alternative, enterprises can construct a global schema as described above but leave the data where it resides. Instead of building a central warehouse, they can employ a data federator, such as MetaMatrix or Aqualogics. Queries (and perhaps updates) can be submitted to the federator. In turn, the federator figures out what queries or updates need to be run at each of the operational sites to construct the correct outcome to the submitted commands.<br /></li></ul><br />For the rest of this relatively short post, we will explain the pros and cons of each approach.<br /><br /><br /><b>Data element "heat": Hot data favors ETL</b><br /><br />In the ETL approach, data transformation occurs when a data element is extracted, while in the federation approach transformations occur at query time. If a data element is queried multiple times, it is obviously cheaper to perform the transformation once, thereby favoring the ETL approach. On the other hand, if a data element is never or only queried once, then the federation approach makes more sense. In summary, if a data element is rarely queried (i.e., it is cold) then federation is desirable. In contrast, hot data elements are better with ETL solutions.<br /><br /><br /><b>Indexing: Federation is harder to optimize</b><br /><br />The data indexing requirements of OLTP are typically quite different from those of data warehouses. Hence, in an ETL approach the warehouse workload can be optimized separately from the OLTP workload on different hardware. In the federation approach, a DBA must balance the needs of both workloads in a single database -- a task that will be much more complex than optimizing two separate workloads.<br /><br />In general, the federation approach will have significantly worse performance because the needs of the two environments must be optimized together, rather than separately.<br /><br /><b><br />Resource management: Faster BI query responses for ETL shops</b><br /><br />In a data warehouse, there is a dedicated machine with optimized indexing for BI users. In contrast, the BI user will typically be prioritized behind OLTP transactions in a data federation. This will lead to poor response time for BI queries (i.e., more recommendations to "go out for lunch" while waiting for the result of a query).<br /><br /><br /><b>Complexity of the schema change: ETL approach performs less joins</b><br /><br />Most data warehouses implement star or snowflake schemas. In contrast, most OLTP systems utilize non-snowflake schemas. As a result, the global schema is quite different from the various operational schemas. In this case, a single record in the global schema may come from several records in the operational schema. Therefore, a federator must perform this join on every query. In contrast, an ETL system will do the join once at load time. Again, the ETL approach should have much better performance when the schema mapping becomes complex.<br /><br /><br /><b>Contention (concurrency control): Federation contention challenges</b><br /><br />In an ETL system, data elements must be extracted from the operational systems periodically. Once loaded into a central warehouse, they become read-only. Hence, there is essentially no contention for locks in the ETL approach. In contrast, the federation approach will mix business intelligence queries and transactions in the operational systems. The result is lock contention, as well as contention for other resources.<br /><br /><br /><b>Timeliness: ETL approaches must deal with out-of-date data issues</b><br /><br />A data warehouse is fundamentally out of date by one-half of the periodicity of the load process. On the other hand, a federator gives up-to-the-second information.<br /><br />To alleviate this disadvantage, some newer warehouse systems, such as Vertica, allow data loading in parallel with querying, a process called "trickle loading."<br /><br /><br /><b>Mapping: Federations can't handle some transformations</b> &nbsp;<br /><br />A common situation is for the operational databases to have customer information, such as customer names. In an ETL approach, whenever a customer datum is encountered, it can be looked up in a steadily growing table containing the mapping from operational system names to global schema names. If a name is not present a new entry can be added. Name mapping is thereby a global operation, supported by a mapping table. In a data federation, name mapping is done on data access. It is difficult to guarantee that the same mapping is applied to each operational system, unless the same table -- discussed above -- is maintained. However, a federator has no facility to perform mappings that require state information. As such, there are some transformations that are very difficult to perform on the fly.<br /><br /><br /><b>Summary: The ETL approach makes sense in most cases</b><br /><br />In summary, virtually all enterprises use the ETL approach for data integration. The data federation market is, in contrast, quite small. The place where I see federations as most viable is when there are many, many data sources (e.g., more than 5,000 sources) and BI users utilize only a small number of them at any given time. In this extreme case, the average data element is accessed zero times before it is updated or deleted. In this instance, one is better off leaving the data where it originates. On the other -- more common -- hand, when most data elements get used several times, the ETL approach will continue to be preferred. <br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>The new economics of the BI market</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2007/12/new-economics-of-the-bi-market.html" />
    <id>tag:www.databasecolumn.com,2007://1.24</id>

    <published>2007-12-05T23:48:54Z</published>
    <updated>2008-03-03T16:14:16Z</updated>

    <summary>Consolidation within the Business Intelligence (BI) market continues. After more than a dozen acquisitions made by Business Objects, Cognos, and Hyperion over the past few years, these BI tools/analytics industry leaders were themselves snapped up in a matter of months by SAP, IBM, and Oracle respectively. But economies of scale enabled by consolidation is just one of the two primary drivers of the new economics of BI. Jerry Held explains how the other driver is economies of innovation that is a result of the continuing stream of new entrants.</summary>
    <author>
        <name>Jerry Held</name>
        
    </author>
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database miscellaneous" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="businessintelligence" label="business intelligence" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="consolidation" label="consolidation" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dbms" label="DBMS" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="held" label="Held" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="innovation" label="innovation" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[Consolidation within the Business Intelligence (BI) market continues, affecting the entire BI software stack:<br /><br /><ul><li>After more than a dozen acquisitions made by Business Objects, Cognos, and Hyperion over the past few years, these BI tools/analytics industry leaders were themselves snapped up in a matter of months by SAP, IBM, and Oracle respectively.<br /><br /></li><li>In the 2005-2006 time frame, the BI/data integration market consolidated as the major DBMS and BI players acquired a host of ETL and data quality solution providers, including Ascential, Sunopsis, FirstLogic, and Acta.<br /><br /></li><li>Earlier in this decade, we witnessed a similar consolidation within the underlying BI/DBMS segment that became dominated by Oracle, IBM, and Microsoft.</li></ul><br />But economies of scale enabled by consolidation is just one of the two primary drivers of the new economics of BI. The other driver is economies of innovation that is a result of the continuing stream of new entrants.<br /><br /><b><br />Continuous market consolidation drives economies of scale</b><br /><br />With their broad range of products, the consolidators -- IBM, Oracle, and SAP -- can now offer their customers the simplicity of one-stop shopping, single contracts, quantity discounts, and a single point of support. Their focus on removing redundant administrative costs, gaining efficiencies in all aspects of distribution, and leveraging commonality in a broad range of products should allow these mega-vendors to provide quality products at lower costs while still improving their own profitability -- a situation that is good for both customer and vendor. The fact that none owns a dominant market share also helps ensure competitive pricing. These economies of scale should benefit the overall BI marketplace.<br /><br /><br /><b>A constant stream of VC-funded entrants drive economies of innovation</b><br /><br />While convenient, one-stop shopping often does not deliver best of breed, innovative solutions. The greatest strengths of the mega-vendors are their extensive product lines -- product lines built on millions of lines of code, sold to huge numbers of customers, and targeted at the core of a very large market. These great assets, however, make it difficult for the mega-vendors to focus on new products for new market needs. As a result, the venture capital world continues to fund a series of new entrants who can provide focus on and agility around ever evolving market needs. These newcomers provide both technology and business model innovations.<br /><br /><br /><b>Business model innovation examples: Open source and SaaS</b><br /><br />A clear example of business model innovation is the open source movement (e.g., MySQL, Postgres, Jaspersoft, Pentaho, Talend, and others). This movement delivers various components of the BI stack to a much broader part of the market -- a market that previously was forced to make due with file systems and spreadsheets. Because of these open source offerings, millions of customers now enjoy economics which allow them to use much more sophisticated tools at very affordable prices. Another promising -- but fledgling -- area of business model innovation is software as a service (SaaS). The SaaS model when applied to BI can change the economic equation in terms of cost and speed of deployment as well as reduction in complexity of operation. Larger vendors are experimenting with this model while nimble startups, such as LogiXML, LucidEra, Oco, Dimensional Insight, SeaTab, and OnDemandIQ, are moving ahead without the constraint of existing business models not designed for a SaaS product line.<br /><br /><br /><b>Technology innovation examples: Netezza and Vertica</b><br /><br />By limiting the design focus of a product and building on a new foundation, it is possible to make breakthroughs in price and performance. In the underlying BI/DBMS area, several companies have made dramatic improvements in scalability, performance, price, manageability, and ease of use. Netezza has demonstrated how a special-purpose product can rapidly gain acceptance if it delivers clear value in a focused segment of the market. Vertica has shown how a ground-up software redesign on top of commodity components can provide very impressive performance at dramatically lower prices (Note: Vertica is the patron of this blog and I'm also the chairman of its board. That said, I believe that Vertica is one of the more important examples of breakthrough economics and the company has plenty of examples to substantiate this claim).<br /><br />Relative to the databases from the large players, these new entrants make it possible to analyze orders of magnitude more data in new ways in less time and at much lower cost -- often on an ad-hoc basis and in real time. Examples include:<br /><br /><ul><li>Using these new technologies, telecom companies keep a year's worth or more of call detail records on line for analysis. Previously, it was only practical to store just 30 days worth of data.<br /><br /></li><li>Online marketers can now analyze the effectiveness of campaigns in real-time and make adjustments mid-stream rather than post-mortem. Problems that were intractable because the cost of doing an analysis was more than the value returned are now affordable.<br /></li></ul><br /><b>In summary: A buyer's BI market<br /></b><br />The dramatic consolidation we've seen in the BI market recently is great for customers. It provides many with one-stop shopping, lower license costs, and better product integration from the mega-vendors. It also paves the way for market entrants who create new possibilities for generating business value with solutions that are optimized to meet today's business requirements, such as analyzing large volumes of data, ad-hoc querying, or implementing real-time operational analytics. The result of these innovative companies' arrival is not the displacement of the mega-vendors at the heart of the BI market but rather the extension of the market. Organizations should take advantage of the new economics and consider a portfolio of new and established technologies to maximize return on their BI investments.<br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>Haderle responds to commenters regarding RDBMS history</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2007/11/haderle-responds-to-commenters.html" />
    <id>tag:www.databasecolumn.com,2007://1.23</id>

    <published>2007-11-28T20:34:50Z</published>
    <updated>2008-02-21T17:24:49Z</updated>

    <summary>Don Haderle responds to two commenters from his previous post about DBMS history. He notes that &quot;You ask whether it&apos;s possible to render a single implementation of a DBMS that satisfactorily handles all usages (OLTP, analytics, ...) well enough such that we don&apos;t need another. I think not. Existing implementations work ... Someday the economics may change -- but not at present. So one leaves the existing system intact and moves the data to another system (e.g., columnar) to do analysis ...&quot;</summary>
    <author>
        <name>Don Haderle</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database history" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[I noticed a couple of comments in response to my <a href="http://www.databasecolumn.com/2007/11/dbms-origins.html">recent blog post</a>, "Once upon a time ... the origins of today's relational database architectures." I respond to them below.<br /><br />Reader "Dave" commented [note: we have made a few edits to the following comment]:<br /><br /><blockquote>Old-fashioned relational database modeling was designed to limit CPU and I/O for both transaction processing and analytical processing by enforcing structure into the data for efficiency benefit. Modern CRMs have traded structure for flexibility (a generalized database structure) while maintaining OLTP performance to the degradation of analytical performance. OLAP inverts the view to speed seek at the cost of slow crud while maintaining flexibility but introduces I/O and CPU overhead for the second database synchronization and operation.<br /><br />Can we go back to structure and a single engine or is the man-hour cost of implementation not worth the low operating cost of an extract to columnar for analytics that are seldom predefined?<br /></blockquote><br />One of the mantras of relational database was design flexibility to handle unanticipated usages of the data. <a href="http://en.wikipedia.org/wiki/Edgar_F._Codd">Codd</a> observed that prior databases were fragile with respect to handling additional information or additional usage. Well known was that new fields added to hierarchies were always done on the bottom right of the hierarchy, not in the segment/record where the information most deserved to be. This was simply to keep from having to redefine the applications and databases.<br /><br />While the relational design may be flexible, it may not be feasible for execution given performance considerations. Hence, one denormalizes the data design sacrificing flexibility for performance. This is still true today in database design, whether it be relational or something else.<br /><br />You ask whether it's possible to render a single implementation of a DBMS that satisfactorily handles all usages (OLTP, analytics, ...) well enough such that we don't need another. I think not. Existing implementations work. The US Federal Government still maintains tax systems on <a href="http://en.wikipedia.org/wiki/M204">M204</a>, which is a 1970s inverted database. It works and the cost to change the applications that use it is prohibitive. Someday the economics may change -- but not at present. So one leaves the existing system intact and moves the data to another system (e.g., columnar) to do analysis or processing for which M204 is not intended.<br /><br /><br />Reader "Michael M David" commented [note: this comment has been truncated; the full comment is on the <a href="http://www.databasecolumn.com/2007/11/dbms-origins.html">original post's page</a>]:<br /><br /><blockquote>My company has finally been able to turn ANSI SQL's processor into
 a full nonlinear hierarchical processor and the same relational optimizations you mention work to make the hierarchical processing more efficient. We have naturally extended this ANSI SQL hierarchical processing to ANSI SQL transparent native XML hierarchical processing. This shows there are still new areas of ANSI SQL operation to explore and utilize.<br /><br />Making ANSI SQL operate as a hierarchical processor is done by modeling nonlinear hierarchical structures using a series of SQL-92 Left Outer Joins. The Left Outer Join syntax models the nonlinear hierarchical structure requiring processing and its associated semantics specifies how to process the defined hierarchical structure hierarchically. The Left Outer Join is also performing the required hierarchical data preservation, and the insertions of the NULLs allow multiple legs each with varying lengths to be stored correctly and&nbsp; <br />separately in the working rowset and resulting rowset. Relational projection where all columns for a node are not selected causes hierarchical node promotion and node collection.<br /></blockquote><br />It's interesting that we have reused much of our hierarchical brethren's technology inside relational databases, especially with the incorporation of XML, while still preserving the nonprocedural attribute of relational. This is true for IBM relational databases as well as many others. It's good to see you doing this.<br /><br />It's also interesting to see hierarchical databases (IMS, ...) incorporate relational technology and support XML data natively.<br /><br />]]>
        
    </content>
</entry>

<entry>
    <title>Once upon a time ... the origins of today&apos;s relational database architectures</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2007/11/dbms-origins.html" />
    <id>tag:www.databasecolumn.com,2007://1.22</id>

    <published>2007-11-20T23:15:59Z</published>
    <updated>2008-02-19T19:16:03Z</updated>

    <summary>Current relational database management systems are largely built on designs from the 1980s. Back then, computers were expensive and slow relative to today&apos;s systems. The minimization of expensive CPU cycles -- not I/O considerations -- was the driving force in early relational DBMS design. The market sweet spot was transaction...</summary>
    <author>
        <name>Don Haderle</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database history" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="db2" label="DB2" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="haderle" label="Haderle" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="ibm" label="IBM" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[Current relational database management systems are largely built on designs from the 1980s. Back then, computers were expensive and slow relative to today's systems. The minimization of expensive CPU cycles -- not I/O considerations -- was the driving force in early relational DBMS design. The market sweet spot was transaction processing coupled with simple decision support, which was generally satisfied by access on a limited set of attributes (dimensions).<br /><br />Transactions retrieved rows by primary or secondary key and performed modest inserts, updates, and deletes on those rows. Direct access was satisfied with indexes or hashes that provided fast access consuming small chunks of processing with adequate concurrency. Most systems averaged one to two hash or indexes per table. This eased the strain on delete and insert having to maintain lots of structures -- an effort that consumes considerable processor cycles.<br /><br /><br /><b>Lessons learned at IBM</b><br />&nbsp;<br />At IBM in the early days of DB2, we focused on minimizing processor cycles for transactions and worked with our hardware cousins to optimize the most relevant segments. Relative to prior generation non-relational DBMS (IMS, IDMS, DATACOM, ...), DB2 cost two-to-three times more, largely due to processor consumption. Ignored by many was the fact that one could develop and deploy an application on DB2 significantly faster than incumbent DBMSs.<br /><br />Applications tended to retrieve entire rows (all attributes) despite the relational directive to retrieve only those attributes of a row that an application really needed. Applications had standard mappings for tables generated by dictionaries that they included in their code. Coding SELECT * and adding predicates was easier than determining the needs of a multipart application that shared the retrieved row with other parts. Moving row attributes from storage buffers to application space was very processing consumptive and exacerbated in distributed systems where the data was moved from storage buffers to network space to application space. Refinement of this trivial activity and many others in processor pipeline, cache optimization, and instruction set made a huge difference in processor consumption for transactions. Searching was a minor activity; to succeed, one had to focus on an amalgam of row processing (crud), serialization, scaling, and availability. By the mid-80s DB2 was cost competitive with incumbent DBMSs.<br /><br />These systems were expanded to address a broader range of business intelligence problems that demanded access to data on unanticipated attributes and analysis of that data by applying functions on the filtered data.<br /><br /><br /><b>Trying to improve search</b><br /><br />To speed up searching in relational systems, we focused on improving full table scan with parallelism, smarter indexing technology, and compression with search on compressed objects. Functions were pushed down in the database to reduce processor cycles and pre-computed via materialized views. However, maintaining the myriad indexes and materialized views is costly in time (processing) and space (storage). This approach has been very successful in instances where access is predictable, but it hasn't satisfied truly random access applications for which full table scans are the only solution. <br /><br />In the 1980s, rows were small (actually the model was an 80 column punched card averaging 20 fields per record) and the number of entities (tables) was small (100 was large). Two- and three-way joins were the norm. Today, the number of attributes per table is in the hundreds with the most perverse having thousands of attributes. The number of tables in the database has climbed into the multiple thousands. Six- and seven-way joins are common; ten- and twelve-way joins are not extraordinary. One only has to track the growth of SAP schemas to witness this change. As a result, searching is significantly more complex given the number of search arguments (attributes) and the number of relationships involved. <br /><br /><b><br />A column architecture emphasizes attributes and relationships</b><br /><br />To address these challenges, it makes sense to design an inverted database where the emphasis is on the attribute lists and the relationships between entities. This is precisely what a columnar database does. The rest is details which will determine success: compressing for efficiency; linking lists for joins; time stamping data elements to provide historical detail, as well as alleviate pressure on loading and updating; adding all of the relational functionality; etc. <br /><br />The simple matter is that the foundation of this type of database is designed to address the very essence of today's business intelligence needs; viz., searching large collections with a large number of attributes and relationships and applying business functions against those collections.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br /><br />]]>
        
    </content>
</entry>

<entry>
    <title>Database management for &quot;big science&quot; applications</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2007/11/databases-for-big-science.html" />
    <id>tag:www.databasecolumn.com,2007://1.21</id>

    <published>2007-11-06T17:26:49Z</published>
    <updated>2008-02-18T19:35:25Z</updated>

    <summary>&quot;Big science&quot; has problems with current database systems, including maintaining the consistency between the data and metadata, differing requirements for various projects, and the lack of automated lineage support. In this post, Michael Stonebraker discusses the underlying issues, notes previous attempts to solve the problems, and asks scientists to help the database research community develop a better DBMS that can support the needs of big science.</summary>
    <author>
        <name>Michael Stonebraker</name>
        
    </author>
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database miscellaneous" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="asap" label="ASAP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="lineage" label="lineage" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="metadata" label="metadata" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="postgres" label="Postgres" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[I recently attended an invitation-only, one-day workshop at the <a href="http://www.slac.stanford.edu/">Stanford Linear Accelerator Center</a>. Attendees included representatives from:<br /><br /><ul><li>The database research community (including me)</li><li>The "big science" community who have BIG data base problems</li><li>Commercial DBMSs vendors</li><li>Other "power users" of database technology, including eBay, Yahoo!, and Google<br /></li></ul><br />The point of the workshop was to look for approaches that would solve the DBMS problems of big science in a better way. The conventional wisdom today -- and I am generalizing a bit -- is to store science data in the file system with metadata about the files stored in a relational DBMS. In both astronomy and particle physics, projected data size is well into the <a href="http://en.wikipedia.org/wiki/Petabyte">petabyte</a> range.<br /><br /><br /><b>The top three DBMS issues for big science<br /></b><br />The big science community has at a variety of problems in terms of DBMSs, including:<br /><br /><ol><li><b>Consistency of data and metadata.</b> Since metadata is stored separately from the data, the programmer is responsible for keeping the two consistent. This reminds me of the DBMS community in the 1970s -- it lamented about the same issue.<br /><b><br /></b></li><li><b>A differing view of DBMS requirements.</b> Science data is stored in the file system because DBMSs don't "do the right thing." However, there seems to be no common statement of what the right thing is. For example, the particle physics folks want time series support for observation data and particle tracks, the astronomy folks want indexing of 3D objects in several coordinate systems, and the remote sensing and astronomy communities want built-in support for multi-dimensional arrays.<br /><br /></li><li><b>No automated lineage support.</b> Support for lineage (provenance) is crucial. It is important for scientists to know how any given data set was derived. In other words, they want to keep track of the sequence of processing steps that has previously been applied. As with the first problem, the programmer currently handles this issue manually.<br /></li></ol><br />Obviously, the best solution to these three problems would be to put everything in a next-generation DBMS -- one capable of keeping track of data, metadata, and lineage. Supporting the latter would require all operations on the data to be done inside the DBMS with user-defined functions -- Postgres-style. <br /><br /><br /><b>A previous effort with Postgres failed</b><br /><br />Clearly big science would like somebody else to take over its storage issues. But that has not happened yet. I am reminded of the <a href="http://meteora.ucsd.edu/s2k/s2k_home.html">Sequoia 2000</a> project in the mid 1990s, which I co-led with Jeff Dozier of UC/Santa Barbara while I was at Berkeley. This was a DEC-sponsored collaborative project between computer scientists and earth scientists to build tools and systems for earth scientists. In the database arena, the goal was to use Postgres for storage. But this part of the project failed because:<br /><br /><ul><li>Postgres had no support for big arrays, which was the predominant data type.<br /><br /></li><li>Postgres had no notion of a processing pipeline whereby raw imagery is "cooked" into finished data products. Hence, there was no way for it to automatically keep track of lineage.<br /><br /></li><li>Postgres was not particularly easy to use for the operations earth scientists wanted to use it for, such as coordinate transformations. Hence they did not see the value of a DBMS over custom C or C++ code operating against the file system.<br /></li></ul><br />The Sequoia experience convinced me that big science would not be happy with anything remotely like what was offered in commercial DBMSs. That leads to the question of the day: "What do they want?" <br /><br /><br /><b>A call for help: Let the research community help develop a science database</b><br /><br />At <a href="http://www-db.cs.wisc.edu/cidr/cidr2007/index.html">CIDR 2007</a>, some of us reported on a prototype called ASAP that we thought might appeal to the science community. This system proposed a real-time processing pipeline, lineage, and good support for large arrays. In other words, we fixed all the problems we saw in the Sequoia project a decade ago.<br /><br />ASAP is languishing because we cannot find any scientists willing to work with us. The ones we have talked to are typically too busy and don't see the near-term value of collaboration. In a sense they are right -- the value of the collaboration would be to define a good science DBMS that could then be commercialized. But that process and the benefits would probably take at least five years.<br /><br />A better solution requires input from big science on the initial ideas from the DBMS research community. The DBMS research community would be thrilled to try and define these operations, but it needs help from big science. <i>This is a plea to big science for help.</i><br /><br />Where could we start working together? A significant problem in developing a science DBMS will be the definition of a small collection of primitives. Relational DBMSs succeeded in business data processing because essentially all users were willing to use SQL engines based on a single data type (table) and a small set of operations (filter, join, aggregate, etc.). To have a chance to succeed, a science DBMS must also have a small set of data types and operations. A small collection of primitive operations is crucial; otherwise, the run-time system will be hopelessly complex. It appears to be a big challenge to come up with one small set of operations, given the diversity of needs I saw at the workshop.<br /><br /><br /><b>Big science can't do it on its own</b><br /><br />The big Web companies have storage issues at least at the scale of big science, and perhaps bigger. Several are in the process -- or have done so already -- of "rolling their own" solutions, having given up on DBMS technology for their immediate needs. However, these companies have much bigger budgets and the skilled manpower to develop custom DBMS solutions than appears to be available to big science. Without the money and resources, it will be crucial for big science to agree on some common standards and then foster their implementation.<br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>Database parallelism choices greatly impact scalability</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2007/10/database-parallelism-choices.html" />
    <id>tag:www.databasecolumn.com,2007://1.20</id>

    <published>2007-10-30T13:15:25Z</published>
    <updated>2008-02-19T19:16:45Z</updated>

    <summary>Large databases require the use of parallel computing resources to get good performance. There are several fundamentally different parallel architectures in use today: shared memory, shared disk, and shared nothing. This post examines each approach in terms of how it impacts database scalability.</summary>
    <author>
        <name>Sam Madden</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database history" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="databaseperformance" label="database performance" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parallelarchitectures" label="parallel architectures" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="shareddisk" label="shared disk" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="sharedmemory" label="shared memory" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="sharednothing" label="shared nothing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[Large databases require the use of parallel computing resources to get good performance. There are several fundamentally different parallel architectures in use today; in this post, Dave DeWitt, Mike Stonebraker, and I review three approaches and reflect on the pros and cons of each. Though these tradeoffs were articulated in the research community twenty years ago, we wanted to revisit these issues to bring readers up to speed before publishing upcoming posts that will discuss recent developments in parallel database design.<br /><br /><b><br />Shared-memory systems don't scale well as the shared bus becomes the bottleneck</b><br /><br />In a shared-memory approach, as implemented on many symmetric multi-processor machines, all of the CPUs share a single memory and a single collection of disks. This approach is relatively easy to program. Complex distributed locking and commit protocols are not needed because the lock manager and buffer pool are both stored in the memory system where they can be easily accessed by all the processors.<br /><br />Unfortunately, shared-memory systems have fundamental scalability limitations, as all I/O and memory requests have to be transferred over the same bus that all of the processors share. This causes the bandwidth of the bus to rapidly become a bottleneck. In addition, shared-memory multiprocessors require complex, customized hardware to keep their L2 data caches consistent. Hence, it is unusual to see shared-memory machines of larger than 8 or 16 processors unless they are custom-built from non-commodity parts (and if they are custom-built, they are very expensive). As a result, shared-memory systems don't scale well.<br /><br /><b><br />Shared-disk systems don't scale well either</b><br /><br />Shared-disk systems suffer from similar scalability limitations. In a shared-disk architecture, there are a number of independent processor nodes, each with its own memory. These nodes all access a single collection of disks, typically in the form of a storage area network (SAN) system or a network-attached storage (NAS) system. This architecture originated with the Digital Equipment Corporation VAXcluster in the early 1980s, and has been widely used by Sun Microsystems and Hewlett-Packard.<br /><br />Shared-disk architectures have a number of drawbacks that severely limit scalability. First, the interconnection network that connects each of the CPUs to the shared-disk subsystem can become an I/O bottleneck. Second, since there is no pool of memory that is shared by all the processors, there is no obvious place for the lock table or buffer pool to reside. To set locks, one must either centralize the lock manager on one processor or resort to a complex distributed locking protocol. This protocol must use messages to implement in software the same sort of cache-consistency protocol implemented by shared-memory multiprocessors in hardware. Either of these approaches to locking is likely to become a bottleneck as the system is scaled.<br /><br />To make shared-disk technology work better, vendors typically implement a "shared-cache" design. Shared cache works much like shared disk, except that, when a node in a parallel cluster needs to access a disk page, it first checks to see if the page is in its local buffer pool ("cache"). If not, it checks to see if the page is in the cache of any other node in the cluster. If neither of those efforts works, it reads the page from disk.<br /><br />Such a cache appears to work fairly well on OLTP but performs less well for data warehousing workloads. The problem with the shared-cache design is that cache hits are unlikely to happen because warehouse queries are typically answered using sequential scans of the fact table (or via materialized views). Unless the whole fact table fits in the aggregate memory of the cluster, sequential scans do not typically benefit from large amounts of cache. Thus, the entire burden of answering such queries is placed on the disk subsystem. As a result, a shared cache just creates overhead and limits scalability.<br /><br />In addition, the same scalability problems that exist in the shared memory model also occur in the shared-disk architecture. The bus between the disks and the processors will likely become a bottleneck, and resource contention for certain disk blocks, particularly as the number of CPUs increases, can be a problem. To reduce bus contention, customers frequently configure their large clusters with many Fiber channel controllers (disk buses), but this complicates system design because now administrators must partition data across the disks attached to the different controllers.<br /><br /><br /><b>Shared-nothing scales the best</b><br /><br />In a shared-nothing approach, by contrast, each processor has its own set of disks. Data is "horizontally partitioned" across nodes. Each node has a subset of the rows from each table in the database. Each node is then responsible for processing only the rows on its own disks. Such architectures are especially well suited to the star schema queries present in data warehouse workloads, as only a very limited amount of communication bandwidth is required to join one or more (typically small) dimension tables with the (typically much larger) fact table.<br /><br />In addition, every node maintains its own lock table and buffer pool, eliminating the need for complicated locking and software or hardware consistency mechanisms. Because shared nothing does not typically have nearly as severe bus or resource contention as shared-memory or shared-disk machines, shared nothing can be made to scale to hundreds or even thousands of machines. Because of this, it is generally regarded as the best-scaling architecture.<br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="parallel_approaches.jpg" src="http://www.databasecolumn.com/images/2007/parallel_approaches.jpg" class="mt-image-center" style="margin: 0pt auto 20px; text-align: center; display: block;" height="182" width="569" /></span><br /><b>The shared nothing approach compliments other enhancements</b><br /><br />As a closing point, we note that this shared nothing approach is completely compatible with other advanced database techniques we've discussed on this blog, such as compression and vertical partitioning. Systems that combine all of these techniques are likely to offer the best performance and scalability when compared to more traditional architectures.<br /><div><br /></div>]]>
        
    </content>
</entry>

<entry>
    <title>CPU trends, like disk trends, will favor adoption of column stores</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/2007/10/cpu-trends-like-disk-trends.html" />
    <id>tag:www.databasecolumn.com,2007://1.18</id>

    <published>2007-10-22T17:25:00Z</published>
    <updated>2008-02-18T03:35:33Z</updated>

    <summary>DeWitt and Madden discuss how trends in CPU technology, similar to trends in mass storage technology, will increasingly drive the use of column-store architecture in database systems designed primarily for processing decision support queries. They discuss the impact of slotted pages in row stores and how column stores avoid the CPU performance penalty.</summary>
    <author>
        <name>David DeWitt</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="cpu" label="CPU" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="databaseperformance" label="database performance" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dewitt" label="DeWitt" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="madden" label="Madden" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="pax" label="PAX" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="rowstores" label="row stores" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[In a <a href="http://www.databasecolumn.com/2007/09/disk-trends.html">recent post</a>, we discussed how mass storage technology trends favor the use of a column-store architecture in database systems designed primarily for processing decision support queries. In this post, Sam Madden and I reflect on why CPU trends may have a similar influence on database design choice.<br /><br /><br /><b>Slotted pages in row stores can slow down CPU performance</b><br /><br />Most row-store architectures use the "slotted pages" concept to place records on disk pages. In addition to some control information, a slotted page can be roughly divided into two major areas as shown in the diagram. <br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="rowstores.jpg" src="http://www.databasecolumn.com/images/2007/rowstores.jpg" class="mt-image-right" style="margin: 0pt 0pt 20px 20px; float: right;" height="360" width="444" /></span> The first area, which generally grows from the front of the page, is used to hold tuples packed tightly one after another on the page. The second area, termed the "slot array," starts at the end of the page and grows toward the front. There is one entry in this array for each tuple on the page. Each entry consists of the offset of the corresponding tuple from the start of the page and the tuple's length.<br /><br />With this layout scheme tuples are addressed by a triple consisting of a file or volume identifier, a page number within the file/volume, and a slot number. Such addresses, commonly referred to as TIDs (for tuple ID) or RIDs (for record ID), provide a level of indirection that turns out to be quite important. In particular, tuples can be put in a different location on the data page while retaining the same slot ID. This means it is possible to accommodate insertions, deletions, and updates without affecting indices with entries pointing to tuples on the page.<br /><br />Now consider what happens as a selection predicate is applied to a file of tuples. Two iterators are used. The first iterator iterates through all the pages in the file. For each page returned by this iterator, a second iterator, which encapsulates the page logic, is initialized and then called once for each record on the page. Each time it is called it advances to the "next" location in the slot array to obtain the offset of the corresponding tuple on the page.&nbsp; <br />Once the offset of the start of the tuple is obtained, the database system's tuple-handling logic computes the offset of the attribute to which the predicate is being applied. These two offsets are then added to the address of the start of the page in memory to get the address of the attribute. Once this address has been computed, the predicate can be applied.<br /><br />In this process, at least two memory accesses per tuple are performed -- one to access the slot array entry and one to access the attribute to perform the selection predicate. For modern CPUs, a critical factor governing how fast an application runs is how often a memory access results in a L2 (data) cache miss. Such misses generally cause modern CPUs to stall for 100s of cycles (waiting for data to be transferred from the main memory into the cache), giving the application, in this case the database system, the appearance that it is running on megahertz -- not a gigahertz -- CPU.<br /><br />Cache lines on today's CPUs are typically either 64 bytes (Intel Core 2 Duo) or 128 bytes (Intel Xeon for both L2 and L3 caches). Assume the computer's cache line is 64 bytes. Assuming the cache is "cold" -- that is, we haven't recently accessed the current page -- 4-byte slot array entries (2 bytes each of offset and length) will incur an L2 cache miss on every 16th access to the slot array, causing another 64 bytes of memory to be read into the cache. On the other hand, every attribute access will incur an L2&nbsp; cache miss -- unless, of course, tuples are 32 or fewer bytes long, in which case each L2 miss will pull two tuples from memory into the cache. If nulls are implemented using a bit array at the front of the tuple the situation may be worse as an additional memory access may be needed to access the bit array to check whether the attribute is null or not. This access will incur a L2 data cache miss. If the attribute is not null, a second miss will occur, except when the attribute and null bit array fall into the same 64-byte cache line.<br /><br />To make this issue more concrete, assume a 32K-byte page size and 200-byte tuples. Each page will hold about 160 tuples. Scanning each page will incur at least 170 L2 cache misses (10 for scanning the slot array and 160 for scanning the tuples themselves) to process these 160 tuples. Modern processors do implement "prefetching," which detects consecutive memory accesses within a cache line and automatically initiates a load of the next cache line before it is referenced. However, this is only likely to help when accessing the slot array (which is walked sequentially), whereas following pointers from the slot array to tuples causes random access throughout the page that cannot be predicted by the prefetcher.<br /><br /><br /><b>Column stores avoid the slotted page CPU penalty</b><br /><br />Now consider what happens with a column store in which each column of a table is stored in a separate file. To simplify the discussion for now, assume no compression is used and that the attribute to which the predicate is being applied is a 4-byte integer. We assume that the column store processes updates, inserts, and deletes by writing them to a separate write- optimized store and periodically merging them into the main data store. This merging process rewrites both the main store as well as any associated indices. This means that attribute values in the main data store can be "dense packed" one after another and that tuple identifiers independent of page position are not needed in the main data store. Hence, column stores do not need a slot array in the main store.<br /><br />Consider the process of scanning the values on each page to apply the selection predicate. With a 64-byte cache line and 4-byte integer attribute values, every 16th access to an attribute will incur a L2 data cache miss. Processing 160 attributes will incur just 10 misses compared to the 170 misses incurred by a row store -- assuming, again, no cache prefetching. Prefetching is likely to help the column store more than the row store, as the data values themselves are now being accessed sequentially rather than the random pattern of access generated by following pointers from the slotted page array.<br /><br />The performance difference actually gets worse with bigger cache lines. For example, with 128-byte cache lines, the row store will incur 165 misses for every 160 tuples processed while the column store will incur only 5 misses.<br /><br />As we mentioned in a previous posting, column stores are highly compressible. Assume that RLE compression is used on a column of integer attributes and that compressing the column yields a compression factor of 10. Each compressed column entry will consist of a (value, position, count) triple and will occupy 10 bytes. With a 64-byte cache line, each miss will bring 6 RLE triples into the cache. Since 160 tuples will, on the average, be compressed into 16 RLE triples, processing 160 tuples will incur only 3 cache misses. That is more than a 3X reduction compared to an uncompressed column store and a 30X reduction from an uncompressed row store!<br /><br /><br /><b>PAX is an option for row stores, but has not been used</b><br /><br />For the sake of "truth in advertising," there is an alternative way of laying out row store tuples on disk pages called PAX (Partition Attributes Across). Developed by Ailamaki and DeWitt, PAX partitions each data page into <i>n</i> smaller pages, one for each of the n attributes of the tuple (read the paper, <i>Weaving Relations for Cache Performance</i> from the 2001 VLDB Conference <a href="http://www.cs.wisc.edu/multifacet/papers/vldb01_pax.pdf">here</a>). As each tuple is inserted into a page, each of its attributes gets stored in the corresponding minipage. The overall effect is a data page organized as <i>n</i> columns. While such a design has essentially the same cache behavior as a column store, it has the same I/O performance as a traditionally organized row store that uses slotted pages. As far as we are aware, no database vendor has yet adopted PAX for use in its products.<br /><br /><br /><b>Comparing the impact on row and column stores</b><br /><br />So what kind of performance difference does this mean when comparing row to column store architecture? Suppose for each memory access to an attribute, we spend 500 CPU cycles "computing" on it (this is consistent with estimates from recent database papers, such as the above-referenced PAX paper). If memory latency is 50 ns, then a L2 cache miss takes about 300 cycles on a 3 GHz processor. On a Core 2 Duo, accessing a record in L2 cache takes 14 cycles. In a non-row-oriented database, with 64-byte caches lines, 200-byte tuples with null headers, and 4-byte slot arrays, memory latencies will add about 660 cycles -- that's 2 + 1 in 16 L2 misses, plus 3 L2 accesses -- per record (1160 cycles in total). In contrast, in a column store, memory latencies will add only 45 cycles -- 1 in 16 L2 misses plus 2 L2 accesses -- per record&nbsp; (545 total). Therefore, if a database is CPU limited -- which is commonly the case as many production databases have enough disks per CPU to ensure this -- a column store will have more than double the tuple throughput of a row store.<br /><br />In summary, in addition to being more compressible and providing significant I/O savings compared to row stores, column stores are significantly more compatible with the design of modern CPUs and can at substantially improve CPU throughput for CPU-bound queries. The result is that as the gap between processor performance and memory latencies grows, column stores will continue to perform better compared to row stores. Additionally, we believe that column stores will also be better able to exploit future CPUs with dozens to hundreds of cores; however, we will reserve that discussion for a future post.<br /><br /><div><br /></div>]]>
        
    </content>
</entry>

</feed>
