Once upon a time ... the origins of today's relational database architectures
Current relational database management systems are largely built on designs from the 1980s. Back then, computers were expensive and slow relative to today's systems. The minimization of expensive CPU cycles -- not I/O considerations -- was the driving force in early relational DBMS design. The market sweet spot was transaction processing coupled with simple decision support, which was generally satisfied by access on a limited set of attributes (dimensions).
Transactions retrieved rows by primary or secondary key and performed modest inserts, updates, and deletes on those rows. Direct access was satisfied with indexes or hashes that provided fast access consuming small chunks of processing with adequate concurrency. Most systems averaged one to two hash or indexes per table. This eased the strain on delete and insert having to maintain lots of structures -- an effort that consumes considerable processor cycles.
Lessons learned at IBM
At IBM in the early days of DB2, we focused on minimizing processor cycles for transactions and worked with our hardware cousins to optimize the most relevant segments. Relative to prior generation non-relational DBMS (IMS, IDMS, DATACOM, ...), DB2 cost two-to-three times more, largely due to processor consumption. Ignored by many was the fact that one could develop and deploy an application on DB2 significantly faster than incumbent DBMSs.
Applications tended to retrieve entire rows (all attributes) despite the relational directive to retrieve only those attributes of a row that an application really needed. Applications had standard mappings for tables generated by dictionaries that they included in their code. Coding SELECT * and adding predicates was easier than determining the needs of a multipart application that shared the retrieved row with other parts. Moving row attributes from storage buffers to application space was very processing consumptive and exacerbated in distributed systems where the data was moved from storage buffers to network space to application space. Refinement of this trivial activity and many others in processor pipeline, cache optimization, and instruction set made a huge difference in processor consumption for transactions. Searching was a minor activity; to succeed, one had to focus on an amalgam of row processing (crud), serialization, scaling, and availability. By the mid-80s DB2 was cost competitive with incumbent DBMSs.
These systems were expanded to address a broader range of business intelligence problems that demanded access to data on unanticipated attributes and analysis of that data by applying functions on the filtered data.
Trying to improve search
To speed up searching in relational systems, we focused on improving full table scan with parallelism, smarter indexing technology, and compression with search on compressed objects. Functions were pushed down in the database to reduce processor cycles and pre-computed via materialized views. However, maintaining the myriad indexes and materialized views is costly in time (processing) and space (storage). This approach has been very successful in instances where access is predictable, but it hasn't satisfied truly random access applications for which full table scans are the only solution.
In the 1980s, rows were small (actually the model was an 80 column punched card averaging 20 fields per record) and the number of entities (tables) was small (100 was large). Two- and three-way joins were the norm. Today, the number of attributes per table is in the hundreds with the most perverse having thousands of attributes. The number of tables in the database has climbed into the multiple thousands. Six- and seven-way joins are common; ten- and twelve-way joins are not extraordinary. One only has to track the growth of SAP schemas to witness this change. As a result, searching is significantly more complex given the number of search arguments (attributes) and the number of relationships involved.
A column architecture emphasizes attributes and relationships
To address these challenges, it makes sense to design an inverted database where the emphasis is on the attribute lists and the relationships between entities. This is precisely what a columnar database does. The rest is details which will determine success: compressing for efficiency; linking lists for joins; time stamping data elements to provide historical detail, as well as alleviate pressure on loading and updating; adding all of the relational functionality; etc.
The simple matter is that the foundation of this type of database is designed to address the very essence of today's business intelligence needs; viz., searching large collections with a large number of attributes and relationships and applying business functions against those collections.
Transactions retrieved rows by primary or secondary key and performed modest inserts, updates, and deletes on those rows. Direct access was satisfied with indexes or hashes that provided fast access consuming small chunks of processing with adequate concurrency. Most systems averaged one to two hash or indexes per table. This eased the strain on delete and insert having to maintain lots of structures -- an effort that consumes considerable processor cycles.
Lessons learned at IBM
At IBM in the early days of DB2, we focused on minimizing processor cycles for transactions and worked with our hardware cousins to optimize the most relevant segments. Relative to prior generation non-relational DBMS (IMS, IDMS, DATACOM, ...), DB2 cost two-to-three times more, largely due to processor consumption. Ignored by many was the fact that one could develop and deploy an application on DB2 significantly faster than incumbent DBMSs.
Applications tended to retrieve entire rows (all attributes) despite the relational directive to retrieve only those attributes of a row that an application really needed. Applications had standard mappings for tables generated by dictionaries that they included in their code. Coding SELECT * and adding predicates was easier than determining the needs of a multipart application that shared the retrieved row with other parts. Moving row attributes from storage buffers to application space was very processing consumptive and exacerbated in distributed systems where the data was moved from storage buffers to network space to application space. Refinement of this trivial activity and many others in processor pipeline, cache optimization, and instruction set made a huge difference in processor consumption for transactions. Searching was a minor activity; to succeed, one had to focus on an amalgam of row processing (crud), serialization, scaling, and availability. By the mid-80s DB2 was cost competitive with incumbent DBMSs.
These systems were expanded to address a broader range of business intelligence problems that demanded access to data on unanticipated attributes and analysis of that data by applying functions on the filtered data.
Trying to improve search
To speed up searching in relational systems, we focused on improving full table scan with parallelism, smarter indexing technology, and compression with search on compressed objects. Functions were pushed down in the database to reduce processor cycles and pre-computed via materialized views. However, maintaining the myriad indexes and materialized views is costly in time (processing) and space (storage). This approach has been very successful in instances where access is predictable, but it hasn't satisfied truly random access applications for which full table scans are the only solution.
In the 1980s, rows were small (actually the model was an 80 column punched card averaging 20 fields per record) and the number of entities (tables) was small (100 was large). Two- and three-way joins were the norm. Today, the number of attributes per table is in the hundreds with the most perverse having thousands of attributes. The number of tables in the database has climbed into the multiple thousands. Six- and seven-way joins are common; ten- and twelve-way joins are not extraordinary. One only has to track the growth of SAP schemas to witness this change. As a result, searching is significantly more complex given the number of search arguments (attributes) and the number of relationships involved.
A column architecture emphasizes attributes and relationships
To address these challenges, it makes sense to design an inverted database where the emphasis is on the attribute lists and the relationships between entities. This is precisely what a columnar database does. The rest is details which will determine success: compressing for efficiency; linking lists for joins; time stamping data elements to provide historical detail, as well as alleviate pressure on loading and updating; adding all of the relational functionality; etc.
The simple matter is that the foundation of this type of database is designed to address the very essence of today's business intelligence needs; viz., searching large collections with a large number of attributes and relationships and applying business functions against those collections.
Categories
Database architecture , Database history4 Comments
Leave a comment

Old fashioned Relational database modelling was designed to limit cpu and I/O for both Transaction processing and Analytical processing by enforcing structure into the data for efficiency benefit.
Modern CRM's have traded structure for flexibility (a generalised database structure) while maintaining OLTP performance to the degradation of analytical performance.
OLAP inverts the view to speed seek at the cost of slow crud, while maintaining flexibility. But introduces a I/o and CPU overhead for the second database synchronisation and opperation.
Can we go back to structure and a single engine or is the man hour cost of implementation not worth the low operating cost of an extract to columnar for analytics that are seldom predefined?
Hello Mr Haderle,
My company has finally been able to turn ANSI SQL’s processor into a full nonlinear hierarchical processor and the same relational optimizations you mention work to make the hierarchical processing more efficient. We have naturally extended this ANSI SQL hierarchical processing to ANSI SQL transparent native XML hierarchical processing. This shows there are still new areas of ANSI SQL operation to explore and utilize.
Making ANSI SQL operate as a hierarchical processor is done by modeling nonlinear hierarchical structures using a series of SQL-92 Left Outer Joins. The Left Outer Join syntax models the nonlinear hierarchical structure requiring processing and its associated semantics specifies how to process the defined hierarchical structure hierarchically. The Left Outer Join is also performing the required hierarchical data preservation, and the insertions of the NULLs allow multiple legs each with varying lengths to be stored correctly and separately in the working rowset and resulting rowset. Relational projection where all columns for a node are not selected causes hierarchical node promotion and node collection.
This nonlinear hierarchical processing actually supports extremely sophisticated multi-leg hierarchical (LCA) queries. This Lowest Common Ancestor (LCA) logic is complex and previously required much hierarchical tree walking to perform. On the other hand, ANSI SQL’s LCA processing is carried out automatically by the effect of the Cartesian product processing. This nonprocedural hierarchical processing is navigationless and can handle any number of legs referenced in a single query. This requires nested LCA processing which normally becomes very complicated to specify procedurally and is very error prone. With ANSI SQL, LCA processing is automatically performed correctly regardless of the nesting involved.
One area that did require some optimization outside of the ANSI SQL hierarchical processor was hierarchical access optimization. Our SQLfX middleware that sits around a standard ANSI SQL processor automatically enables it to operate hierarchically and uses it as its hierarchical processing engine. The SQLfX middleware performs the hierarchical optimization based on the hierarchical structure and the active query’s data references and then trims the unneeded paths of the tree. The tree is represented in the input Left Outer Join hierarchical definition. For more info on SQLfX see our SQLfX User doc with examples at: http://www.adatinc.com/images/Verifying_SQLfX_Current.pdf.
What is "linking lists for joins"? Can you provide a link to a description of this? Thanks.
Link lists are fundamental data structures; see this Wikipedia entry for a write up. The columnar database stores columns in lists, which are linked to other columns to form rows, which are linked to other rows in the same or other tables to form joins. There are many methods of linking - positional (viz., the ordinal position of the entry in one list is related to the ordinal position of an entry in another list), symbolic (e.g., early relational systems generated a TID -- Tuple ID -- and related keys to that TID), direct addressable (viz., an address of the related entity), or computational (viz., a link value plugged into an algorithm which provides link to relationships). This is not an exhaustive list, these are frequently used. Each has it's benefit measured in performance and garbage collection.