Optimizations around Cold SSTables
Cassandra’s storage architecture is based around large, immutable files called SSTables. SSTables are regularly combined through compaction to evict obsolete data and improve the efficiency of reading from partitions that had been fragmented across multiple files. Historically, Cassandra has indexed all SSTables at the same granularity, using resources proportional to the size of the data stored within each.
However, not all SSTables are equally important. In particular, under many common workloads older SSTables may not be frequently read. For example, if reads skew heavily towards recently written data, new SSTables will serve the vast majority of reads. Many time series data models have a similar effect, where only recently written data is frequently read.
Cassandra 2.0.2 laid the foundation to allow Cassandra to allocate its resources more intelligently by tracking SSTable read rates, which gave operators the information necessary to manually tune things on a per-table basis. 2.0.3 took the next step with improvements to size-tiered compaction based on this data, with automatic resource management coming in 2.1.
Optimizing Size Tiered Compaction for Cold SSTables
Two optimizations around handling cold SSTables were added to Size Tiered Compaction Strategy in Cassandra 2.0.3.
Compact the Hottest SSTables First
The first optimization is to prioritize compaction of the hottest SSTables. That is, if there are multiple sets of SSTables that can be compacted next, the set with the highest collective reads/sec per key will be compacted first. Ideally, this will help to more quickly merge partition fragments that are read frequently. (I'll note that we're currently investigating more sophisticated ways to pick SSTables with a lot of overlapping partitions for compaction.)
Avoid Compacting Cold SSTables
The second optimization tries to avoid compacting cold SSTables at all. A new compaction strategy option, cold_reads_to_omit
, was added to STCS and may be set per table. The value should be a float between 0.0 and 1.0 representing the maximum percentage of reads/sec that the ignored sstables may account for. In other words, as many cold sstables as possible will be ignored during compaction while retaining at least 1 - cold_reads_to_omit
of the total reads/sec for the table.
An example may clarify this. Suppose cold_reads_to_omit
is set to 0.1 and we have four equally sized SSTables with the following read rates: SSTable A has 100 reads/sec, B has 5 reads/sec, C has 4 reads/sec, and D has 3 reads/sec. In total, the SSTables have 112 reads/sec. With cold_reads_to_omit
set to 0.1, we can ignore the coldest SSTables as long as they collectively have less than 11.2 reads/sec. This means that C and D, with only 7 reads/sec total, can be ignored for compaction purposes, while A and B are still candidates.
Starting in Cassandra 2.1, this feature will be enabled by default with a cold_reads_to_omit
value of 0.05. This option is also available in 2.0.3 and later, but is disabled by default. To enable it, you can use cqlsh:
ALTER TABLE mykeyspace.mytable WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'cold_reads_to_omit': 0.05};
When tuning this option, consider what percentage of your reads hit "cold" data and set the value slightly below that. You may want to keep an eye on the "SSTables" column of nodetool cfhistograms
to see if too many reads are spanning a large number of SSTables. If so, consider lowering the value. A value of 0.0 will result in all SSTables being candidates for compaction.
Resizing Index Summaries
In Cassandra 2.1, we have reduced the memory usage of systems with many cold SSTables. Cassandra uses two levels of indexes to lookup the position of rows within SSTables. The first level, usually called the primary index, is a per-SSTable file containing one entry for every partition in the SSTable. The second level, called the index summary, contains a sampling of the entries (by default, every 128th entry) in the primary index. The summary is stored in memory and is used to determine an approximate position to scan the primary index. The index summary was moved off-heap in 2.0, but on large nodes, this memory consumption still adds up.
In 2.1 index summaries will no longer contain a static number of entries. They will be periodically resized, if necessary, to fit within a fixed memory pool size. A new option, index_summary_capacity_in_mb
, has been added to cassandra.yaml
to allow you to set a limit on the total memory consumption of all index summaries. By default, this is set to 5% of the heap size. With an 8 GiB heap, this is enough space to handle roughly 2 billion rows per node at "full sampling", where every 128th row gets an index summary entry. At "minimum sampling", where every 2048th row gets an index summary entry, this is enough space to handle roughly 40 billion rows per node.
If the index summaries are below the configured limit for memory consumption they will be left at their full size. If the limit is hit, the index summaries for the coldest SSTables will be shrunk until memory consumption is below the limit. Because there is a cost associated with resizing summaries, this operation is only performed periodically. By default, that period is every hour, but this is configurable through the index_summary_resize_interval_in_minutes
option in cassandra.yaml
.
It's also worth noting that we plan to make the upper and lower limit for index summary resolution configurable, which may also help to improve performance on reads of hot SSTables.