Designing Systems for the Grid: The Problem with "Retrofitting," Part 1
One of the key features of new data warehouse databases such as Vertica is their ground-up support for distributed, shared-nothing grid computing architectures. Because of scalability and low costs, such architectures are becoming the norm in large enterprises, and because of their scalability requirements, data warehouses are a natural fit in this world.
Some database vendors have attempted to "retrofit" their centralized DBMS designs to work in a distributed world. The basic idea of retrofitting is to reuse as much of the centralized code base as possible in producing a distributed design. The motivation for retrofitting is clear: In building a database system, it is often easier to reuse existing components and adapt them for new uses than it is to build a new system from scratch. This reduces the time-to-market of a product that can claim grid support. But the performance improvements of these retrofitted systems often do not live up to the raw increases in horsepower that grid architectures provide. There are two reasons for this:
We will illustrate this point using the database query optimizer as an illustrative example of how retrofitting strategies can fail. This argument will require some understanding of how centralized query optimizers work. Therefore, we will divide this discussion into two parts. The first installment will provide a background on centralized query optimization; the second installment will show why retrofitting a centralized query optimizer to work on the grid can lead to poor performance when evaluating queries.
Part 1: A Primer on Centralized Query Optimization
In this installment, we present a primer on centralized query optimization.
The purpose of a query optimizer is to produce a cost-effective evaluation plan (or just plan) for any query submitted to the database. The basic strategy used to come up with this plan is largely the same for every query optimizer:
One of the most crucial design decisions that affects the effectiveness of a query optimizer lies in how it limits the size of the space of candidate plans (the search space) that it must consider. Specifically, a query that includes multiple tables in its FROM clause can be evaluated using any of a number of plans that differ only by the order in which these tables are joined. Consider, for example, the SQL query fragment below:
This query must join 4 tables: T1, T2, T3, and T4. The order in which pairs of tables are joined can vary. For example, the binary tree below (hereafter referred to as a join plan) shows one possible join ordering.
The join plan above specifies that the first join to be performed is of T1 and T2 (the "bowtie" icon specifies a join), followed by the result being joined with T3, and the result of this in turn being joined with T4. This join plan has a structure which is left-deep (or equivalently, right-shallow) because the right branch of every join in the plan is a base table.
For the 4-table query above, there are 4! = 24 different left-deep join plans that differ according to which tables correspond to which leaves of the tree (4! = the number of sequences of a set of 4 items). Aside from the left-deep plans, there are 4 other join plan structures that are possible for the query above as illustrated below. Note that each of these join plan structures also has 24 variations given a query with 4 tables (by permuting the tables at the leaves), so in all, there are 5 * 24 = 120 join plans for an optimizer to consider for this query.

In general, given n tables to be joined, there are:
possible join plan structures(1), and for each join plan structure, n! possible join plans, giving a total of:
join plans that an optimizer could consider. As the number of tables in a query grows, the number of join plans to consider quickly becomes infeasible. For example, whereas a query with 4 tables requires consideration of 120 join plans, a query with 5 tables requires consideration of 151,200 join plans, a query with 6 tables requires consideration of 3,991,680 join plans, and so on.
To cope with this enormous search space, all query optimizers must somehow limit the set of join plans considered. IBM's System R (from the late 1970s) first introduced the idea of limiting the search space to the set of join plans with a left-deep join plan structure because left-deep plans ensure that every binary join is performed with at least one participant table on disk, thereby ensuring that a join operator can produce output incrementally as its input data arrives (pipelining). The left-deep restriction reduces the number of join plans to consider for a query of n tables to n!, and dynamic programming techniques can be used to find the "best" query plan in this space in exponential time. In practice, this is a reasonable amount of time to process queries consisting of roughly 30 tables or fewer (YMMV), and thus, this heuristic is still used to narrow the search space of most commercial DBMS.
Of course, there are many other challenges in designing an effective query optimizer aside from managing the search space, including proper choices of access methods and indexes, query unnesting, etc. But in the next installment of this blog, I will show how the typical retrofitted query optimizer determines its search space for plans that apply to the grid and how the resulting optimizer can fail to produce appropriate plans.
Part 2 will be available next week . . .
Some database vendors have attempted to "retrofit" their centralized DBMS designs to work in a distributed world. The basic idea of retrofitting is to reuse as much of the centralized code base as possible in producing a distributed design. The motivation for retrofitting is clear: In building a database system, it is often easier to reuse existing components and adapt them for new uses than it is to build a new system from scratch. This reduces the time-to-market of a product that can claim grid support. But the performance improvements of these retrofitted systems often do not live up to the raw increases in horsepower that grid architectures provide. There are two reasons for this:
- Code that gets reused in a retrofitted system is often brittle as a result of many years of patchwork. To change this code introduces potential instabilities, and thus, there is a natural desire to instead treat such code as "black boxes."
- Black box code was often designed with assumptions of a centralized architecture, and these assumptions may constrain performance when executed over a distributed system over which the assumptions do not hold.
We will illustrate this point using the database query optimizer as an illustrative example of how retrofitting strategies can fail. This argument will require some understanding of how centralized query optimizers work. Therefore, we will divide this discussion into two parts. The first installment will provide a background on centralized query optimization; the second installment will show why retrofitting a centralized query optimizer to work on the grid can lead to poor performance when evaluating queries.
Part 1: A Primer on Centralized Query Optimization
In this installment, we present a primer on centralized query optimization.
The purpose of a query optimizer is to produce a cost-effective evaluation plan (or just plan) for any query submitted to the database. The basic strategy used to come up with this plan is largely the same for every query optimizer:
a) It first formulates a set of candidate plans that could be used to evaluate the query
b) It then applies a cost model to predict the execution time (cost) of each of the candidate plans. A cost model consists of a set of formulas that specify the sizes of intermediate query results, and the cost (e.g., time) required to produce them. For example, a simplistic cost model measures cost as the number of disk reads required to evaluate the query, with the idea that queries that perform the fewest disk reads will execute in the least time.
c) Upon evaluating the cost of each candidate plan, the query optimizer then selects the plan with least cost.
One of the most crucial design decisions that affects the effectiveness of a query optimizer lies in how it limits the size of the space of candidate plans (the search space) that it must consider. Specifically, a query that includes multiple tables in its FROM clause can be evaluated using any of a number of plans that differ only by the order in which these tables are joined. Consider, for example, the SQL query fragment below:
SELECT *
FROM T1, T2, T3, T4
WHERE ...
This query must join 4 tables: T1, T2, T3, and T4. The order in which pairs of tables are joined can vary. For example, the binary tree below (hereafter referred to as a join plan) shows one possible join ordering.
The join plan above specifies that the first join to be performed is of T1 and T2 (the "bowtie" icon specifies a join), followed by the result being joined with T3, and the result of this in turn being joined with T4. This join plan has a structure which is left-deep (or equivalently, right-shallow) because the right branch of every join in the plan is a base table. For the 4-table query above, there are 4! = 24 different left-deep join plans that differ according to which tables correspond to which leaves of the tree (4! = the number of sequences of a set of 4 items). Aside from the left-deep plans, there are 4 other join plan structures that are possible for the query above as illustrated below. Note that each of these join plan structures also has 24 variations given a query with 4 tables (by permuting the tables at the leaves), so in all, there are 5 * 24 = 120 join plans for an optimizer to consider for this query.

In general, given n tables to be joined, there are:
(2(n-1))!
---------------
n! * (n-1)!
---------------
n! * (n-1)!
possible join plan structures(1), and for each join plan structure, n! possible join plans, giving a total of:
(2(n-1))!
---------------
(n-1)!
---------------
(n-1)!
join plans that an optimizer could consider. As the number of tables in a query grows, the number of join plans to consider quickly becomes infeasible. For example, whereas a query with 4 tables requires consideration of 120 join plans, a query with 5 tables requires consideration of 151,200 join plans, a query with 6 tables requires consideration of 3,991,680 join plans, and so on.
To cope with this enormous search space, all query optimizers must somehow limit the set of join plans considered. IBM's System R (from the late 1970s) first introduced the idea of limiting the search space to the set of join plans with a left-deep join plan structure because left-deep plans ensure that every binary join is performed with at least one participant table on disk, thereby ensuring that a join operator can produce output incrementally as its input data arrives (pipelining). The left-deep restriction reduces the number of join plans to consider for a query of n tables to n!, and dynamic programming techniques can be used to find the "best" query plan in this space in exponential time. In practice, this is a reasonable amount of time to process queries consisting of roughly 30 tables or fewer (YMMV), and thus, this heuristic is still used to narrow the search space of most commercial DBMS.
Of course, there are many other challenges in designing an effective query optimizer aside from managing the search space, including proper choices of access methods and indexes, query unnesting, etc. But in the next installment of this blog, I will show how the typical retrofitted query optimizer determines its search space for plans that apply to the grid and how the resulting optimizer can fail to produce appropriate plans.
(1) This is the known as the nth Catalan number, which specifies (among other things) the number of binary tree "shapes" consisting of n leaves.
Part 2 will be available next week . . .

Leave a comment