Field Fodder -- Compression in Real World Datasets

| | Comments (0)

With database volumes growing exponentially (see this previous post) and CPUs far out performing disks (see this previous post), compression has become a hot topic among database management solutions. Just don't believe everything you hear about compression. Product marketing is ripe with claims of anywhere from 2:1 to 30:1 compression (Note: A ratio of 2:1 is equivalent to 50% compression while 30:1 is equivalent to 96.6% compression). While these ratios may be true for data cooked up in the lab, real world compression rates will vary dramatically depending on the data in your warehouse and how you load and query it.

Compression is very data dependent. It is expected that "your mileage may vary," but with compression the differences may vary by orders of magnitude. Analyzing applications across industries we have found databases are compressible to different extents.


Compression in the Real World

In financial services, stock market trade data for all US exchanges includes 250 days of data and 10,000 instruments each year. At 100 million trades per day (a conservative number), a dataset that records the date and time, instrument, price, and volume will use between 3GB and 10GB (uncompressed) per day depending on the representation (e.g., binary vs. ASCII). By one year, this raw data will be anywhere from 1-3TB. Using an off the shelf Lempel-Ziv (LZ) algorithm may compress it by 2:1.

In the telecommunications space, call detail records include information describing the call path, times, features, and switches. An average record may be 550 bytes long, and a regional telco may record 500 million per day resulting in uncompressed source data of 275GB per day (or 100TB a year). This data, also using off the shelf LZ, compresses by 5:1. These compression factors relative to raw data can vary dramatically depending on whether the data is preprocessed (e.g., encoding long names into ids, changing timestamps with time-zones into GMT, or replacing empty strings with nulls).  

In contrast, column stores often achieve up to three times more compression, reaching factors of 20:1 versus raw data by isolating variability in the data. In a column store, each block contains data of the same attribute and type, and sorted columns guarantee homogeneity even for trickle loads and high cardinality data. This also allows column stores to use more effective algorithms than vanilla LZ. 

Vertica's customers span many industries and applications, so we have been able to assess compression with a large variety of real world datasets. For example, looking at the compressibility of different datasets that customers load into Vertica, we see the following as typical compression ratios, relative to raw ASCII delimited data (e.g., comma separated values):

  • CDR - 8:1 (87%)
  • Consumer Data - 30:1 (96%)
  • Marketing Analytics - 20:1 (95%)
  • Network logging - 60:1 (98%)
  • Switch Level SNMP - 20:1 (95%)
  • Trade and Quote Exchange - 5:1 (80%)
  • Trade Execution Auditing Trails - 10:1 (90%)
  • Weblog and Click-stream - 10:1 (90%)


Achieving Better Compression

In order to achieve compression rates higher than LZ, Vertica implements a variety of homegrown encoding and compression algorithms specifically designed for a column store database. In addition to optimized block layouts and well-known algorithms, such as run length encoding and delta value encoding, Vertica uses optimized integer and floating point compression algorithms and compound encoding that combines algorithms to increase compression and overall system performance.

Note that Vertica not only reads compressed data off of disk but also processes queries on the compressed data, saving memory and CPU bandwidth (see this previous post). This cannot be done when compressing with LZ. 

Applying LZ to any database, a server with enough CPU and a threaded storage subsystem will reduce I/O at the expense of CPU. The result is performance improvements as high as 30% and space savings up to 7:1 compared to raw data. These numbers approximate the best you can get out of a traditional database, storing records as variable sized rows with headers and applying per-block compression. These numbers also assume that data in any given block is homogenous (e.g., same date, instrument, and customer), which may be the case for bulk loaded data but is not necessarily the case when trickle loading. In a trickle load stream that adds only thousands of records per second, most databases add records into any block with free space, resulting in mixed data values and reducing overall compression. In addition, auxiliary structures such as indexes and materialized views add bloat that is more difficult to compress when these structures need to be updated as records are added.

In a recent head-to-head comparison with a competitor, a prospective customer was able to only test 300GB of raw data on a popular row store because its server was limited to 1TB of disk and the additional structures required to achieve adequate performance consumed three times the raw data size. Loading into Vertica the calculated capacity was close to 4TB of raw data on the same 1TB of disk.

You can use the compression ratio in the table to determine what your storage requirements would be using Vertica. With other vendors, you mileage will most certainly vary, but don't forget to factor in indexes, materialized views, and other auxiliary structures for row stores (such as per tuple headers), which are likely to require 2-5x these sizes.


 

Categories

,

Leave a comment

About this Post

This page contains a single post by Omer Trajman published on September 23, 2008 9:15 AM.

Debunking Another Myth: Column-Stores vs. Vertical Partitioning was the previous entry in this blog.

Find recent content on the main index or look in the archives to find all content.