Database parallelism choices greatly impact scalability

| | Comments (5)
Large databases require the use of parallel computing resources to get good performance. There are several fundamentally different parallel architectures in use today; in this post, Dave DeWitt, Mike Stonebraker, and I review three approaches and reflect on the pros and cons of each. Though these tradeoffs were articulated in the research community twenty years ago, we wanted to revisit these issues to bring readers up to speed before publishing upcoming posts that will discuss recent developments in parallel database design.


Shared-memory systems don't scale well as the shared bus becomes the bottleneck


In a shared-memory approach, as implemented on many symmetric multi-processor machines, all of the CPUs share a single memory and a single collection of disks. This approach is relatively easy to program. Complex distributed locking and commit protocols are not needed because the lock manager and buffer pool are both stored in the memory system where they can be easily accessed by all the processors.

Unfortunately, shared-memory systems have fundamental scalability limitations, as all I/O and memory requests have to be transferred over the same bus that all of the processors share. This causes the bandwidth of the bus to rapidly become a bottleneck. In addition, shared-memory multiprocessors require complex, customized hardware to keep their L2 data caches consistent. Hence, it is unusual to see shared-memory machines of larger than 8 or 16 processors unless they are custom-built from non-commodity parts (and if they are custom-built, they are very expensive). As a result, shared-memory systems don't scale well.


Shared-disk systems don't scale well either


Shared-disk systems suffer from similar scalability limitations. In a shared-disk architecture, there are a number of independent processor nodes, each with its own memory. These nodes all access a single collection of disks, typically in the form of a storage area network (SAN) system or a network-attached storage (NAS) system. This architecture originated with the Digital Equipment Corporation VAXcluster in the early 1980s, and has been widely used by Sun Microsystems and Hewlett-Packard.

Shared-disk architectures have a number of drawbacks that severely limit scalability. First, the interconnection network that connects each of the CPUs to the shared-disk subsystem can become an I/O bottleneck. Second, since there is no pool of memory that is shared by all the processors, there is no obvious place for the lock table or buffer pool to reside. To set locks, one must either centralize the lock manager on one processor or resort to a complex distributed locking protocol. This protocol must use messages to implement in software the same sort of cache-consistency protocol implemented by shared-memory multiprocessors in hardware. Either of these approaches to locking is likely to become a bottleneck as the system is scaled.

To make shared-disk technology work better, vendors typically implement a "shared-cache" design. Shared cache works much like shared disk, except that, when a node in a parallel cluster needs to access a disk page, it first checks to see if the page is in its local buffer pool ("cache"). If not, it checks to see if the page is in the cache of any other node in the cluster. If neither of those efforts works, it reads the page from disk.

Such a cache appears to work fairly well on OLTP but performs less well for data warehousing workloads. The problem with the shared-cache design is that cache hits are unlikely to happen because warehouse queries are typically answered using sequential scans of the fact table (or via materialized views). Unless the whole fact table fits in the aggregate memory of the cluster, sequential scans do not typically benefit from large amounts of cache. Thus, the entire burden of answering such queries is placed on the disk subsystem. As a result, a shared cache just creates overhead and limits scalability.

In addition, the same scalability problems that exist in the shared memory model also occur in the shared-disk architecture. The bus between the disks and the processors will likely become a bottleneck, and resource contention for certain disk blocks, particularly as the number of CPUs increases, can be a problem. To reduce bus contention, customers frequently configure their large clusters with many Fiber channel controllers (disk buses), but this complicates system design because now administrators must partition data across the disks attached to the different controllers.


Shared-nothing scales the best

In a shared-nothing approach, by contrast, each processor has its own set of disks. Data is "horizontally partitioned" across nodes. Each node has a subset of the rows from each table in the database. Each node is then responsible for processing only the rows on its own disks. Such architectures are especially well suited to the star schema queries present in data warehouse workloads, as only a very limited amount of communication bandwidth is required to join one or more (typically small) dimension tables with the (typically much larger) fact table.

In addition, every node maintains its own lock table and buffer pool, eliminating the need for complicated locking and software or hardware consistency mechanisms. Because shared nothing does not typically have nearly as severe bus or resource contention as shared-memory or shared-disk machines, shared nothing can be made to scale to hundreds or even thousands of machines. Because of this, it is generally regarded as the best-scaling architecture.

parallel_approaches.jpg
The shared nothing approach compliments other enhancements

As a closing point, we note that this shared nothing approach is completely compatible with other advanced database techniques we've discussed on this blog, such as compression and vertical partitioning. Systems that combine all of these techniques are likely to offer the best performance and scalability when compared to more traditional architectures.

 

Categories

,

5 Comments

Mark Leith said:

MySQL Cluster is a shared nothing architecture, this should really be clarified in your matrix above.

http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-overview.html

Regards

Mark

(Disclaimer: Senior Support Engineer, MySQL)

I am not quite clear on whether "shared nothing" architecture is something that is supposed to be implemented by the database engine (eg - transparent to the application), or implemented by the software that's the client of the database?

I am very familiar with horizontal data partitioning, and have built highly scalable system still in production that used horizontal data partitioning across separate database instances. It is harder to build such software, as the data layer must be aware of which instance to route the data to and query from; once that solved, however, the performance and resulting scalability of the system can be exceptionally good.

I am very curious to learn more about how database engines implement horizontal "share nothing" architecture in a generic way, that's independent of application needs and its particular DB schema. It seems like a very hard problem to solve.

Like Mark Leith, I believe that your matrix is in error. Microsoft SQL Server has also used shared-nothing clustering (which they refer to as federated databases) for several years - at least since the release of SQL Server 2000.

Cheers,
Chris

Brian Aker said:

Hi!

The definition you propose for shared memory is I believe a bit skewed. Any client server database would be defined by the definition you are proposing and yet you only group a few databases in this category. For instance DB2 is obviously capable of making use of SMP systems.

Your information on MySQL is off as well. For shared disk MyISAM is capable of doing this (though it is rare to see this in production). Also MySQL has two shared nothing solutions at this point.

I realize you are trying to make a start of promoting your own database, Vertica, which is shared nothing, but I think you could do a better job of promoting it by just concentrating on the facts of your own solution.

Cheers,
-Brian

Jason said:

Aren't distributed partitioned views in Microsoft SQL also a variant of Shared Nothing?

Leave a comment

About this Post

This page contains a single post by Sam Madden published on October 30, 2007 9:15 AM.

CPU trends, like disk trends, will favor adoption of column stores was the previous entry in this blog.

Database management for "big science" applications is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.