CompanySeptember 19, 2011

What’s new in Cassandra 1.0: Compression

What’s new in Cassandra 1.0: Compression

Cassandra 1.0 introduces support for data compression on a per-ColumnFamily basis, one of the most-requested features since the project started. Compression maximizes the storage capacity of your Cassandra nodes by reducing the volume of data on disk. In addition to the space-saving benefits, compression also reduces disk I/O, particularly for read-dominated workloads.

Compression benefits

Besides data size, compression typically improves both read and write performance. Cassandra is able to quickly find the location of rows in the SSTable index, and only decompresses the relevant row chunks. This means compression improves read performance not just by allowing a larger data set to fit in memory, but it also benefits workloads where the hot data set does not fit into memory.

Unlike in traditional databases, write performance is not negatively impacted by compression in Cassandra. Writes on compressed tables can in fact show up to a 10 percent performance improvement. In traditional relational databases, writes require overwrites to existing data files on disk. This means that the database has to locate the relevant pages on disk, decompress them, overwrite the relevant data, and then compress them again - an expensive operation in both CPU cycles and disk I/O.

Because Cassandra SSTable data files are immutable (they are not written to again after they have been flushed to disk), there is no recompression cycle necessary in order to process writes. SSTables are only compressed once, when they are written to disk.

Overall, we are seeing the following results from enabling compression, depending on the data characteristics:

  • 2x-4x reduction in data size
  • 25-35% performance improvement on reads
  • 5-10% performance improvement on writes

When to use compression

Compression is best suited for ColumnFamilies where there are many rows, with each row having the same columns, or at least many columns in common. For example, a ColumnFamily containing user data such as username, email, etc., would be a good candidate for compression. The more similar the data across rows, the greater the compression ratio will be, and the larger the gain in read performance.

Compression is not as good a fit for ColumnFamilies where each row has a different set of columns, or where there are just a few very wide rows. Dynamic column families such as this will not yield good compression ratios.

Configuring compression on a ColumnFamily

When you create or update a column family, you can choose to make it a compressed column family by specifying the following storage properties:

  • compression_options: this is a container property for setting compression options on a column family. The compression_options property contains the following options:
    • sstable_compression: specifies the compression algorithm to use when compressing SSTable files. Cassandra supports two built-in compression classes: SnappyCompressor (Snappy compression library) and DeflateCompressor (Java zip implementation).Snappy compression offers faster compression/decompression while the Java zip compression offers better compression ratios. Choosing the right one depends on your requirements for space savings over read performance. For read-heavy workloads, Snappy compression is recommended.Developers can also implement custom compression classes using the org.apache.cassandra.io.compress.ICompressor interface.
    • chunk_length_kb: sets the compression chunk size in kilobytes. The default value (64) is a good middle-ground for compressing column families with either wide rows or with skinny rows. With wide rows, it allows reading a 64kb slice of column data without decompressing the entire row. For skinny rows, although you may still end up decompressing more data than requested, it is a good trade-off between maximizing the compression ratio and minimizing the overhead of decompressing more data than is needed to access a requested row.The compression chunk size can be adjusted to account for read/write access patterns (how much data is typically requested at once) and the average size of rows in the column family.

You can enable compression when you create a new column family, or update an existing column family to add compression later on. When you add compression to an existing column family, existing SSTables on disk are not compressed immediately. Any new SSTables that are created will be compressed, and any existing SSTables will be compressed during the normal Cassandra compaction process. (If necessary, you can force existing sstables to be rewritten and compressed by using the nodetool scrub tool.)

For example, to create a new column family with compression enabled using the Cassandra CLI, you would do the following:


[default@demo] CREATE COLUMN FAMILY users
WITH key_validation_class=UTF8Type
AND column_metadata = [
{column_name: name, validation_class: UTF8Type}
{column_name: email, validation_class: UTF8Type}
{column_name: state, validation_class: UTF8Type}
{column_name: gender, validation_class: UTF8Type}
{column_name: birth_year, validation_class: LongType}
]
AND compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64};

Conclusion

Compression in Cassandra 1.0 is an easy way to reduce storage volume requirements while increasing performance. Compression can be easily added to existing ColumnFamilies after an upgrade, and the implementation allows power users to tweak chunk sizes for maximum benefit.

Previously

  • What's new in Cassandra 0.8
  • What's new in Cassandra 0.7
  • What's new in Cassandra 0.6

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.