Apache Cassandra 2.2 – Bootstrapping performance improvements for Leveled Compaction
Background:
If you are using the LeveledCompactionStrategy there is an exciting feature coming in Apache Cassandra 2.2.
For background information on how leveled compaction works please refer here and here.
When bootstrapping a new node into your cluster the sequence is as follows:
- Calculate range(s) of new node, notify ring of these pending ranges
- Calculate the node(s) that currently own these ranges and will no longer own them once the bootstrap completes
- Stream the data from these nodes to the bootstrapping node
- Join the new node to the ring so it can serve traffic
Sounds straight forward, except for one surprise. There will be many pending compactions on the new node once it's finished bootstrapping. This will affect the read latency of the new node and the cluster negatively since the purpose of compaction is to organize the data for optimal read performance. If the data isn't properly organized to the spec of the compaction strategy post streaming then read latency will suffer. Operators usually avoid this issue by passing -Dcassandra.join_ring=false when starting the new node and wait for the bootstrap to finish along with the followup compactions before putting the node into the ring manually with nodetool join.
For SizeTieredCompactionStrategy (STCS) we have no easy workaround for these compactions; we must re-compact the data. The good news is STCS can be easily parallelized and the write amplification is low. For LeveledCompactionStrategy (LCS) however, your data must be completely re-leveled! This means the data will be re-compacted many times as it moves up the levels and there is an implicit bottleneck in LCS from Level 0 to Level 1. For nodes with large amounts of data on them you can see tens of thousands of pending compactions outstanding which can take days to complete! ah!
Boostrapping w/ Levels:
CASSANDRA-7460 addresses the issue of re-leveling bootstrapped data by re-using the level of each source SSTable streamed to the new node. Since bootstrapping starts with an empty slate we know we can move the original data directly to the correct level without actually compacting it (we know there will be no partition overlaps per level). This means we can go from days of compacting after adding a new node to zero compactions. Right on!
Let's show this in action by writing data to a single node, then bootstrap a second node and look at the compactions that took place. First I started a single node and loaded a few hundred MB of data on a table using LeveledCompactionStrategy, once loaded and compacted the levels looked like this:
SSTables in each level on the source node: [0, 10, 17, 0, 0, 0, 0, 0, 0]
I then bootstrapped a new node and looked at the compaction histories on the new node, once with Cassandra 2.1 and once with Cassandra 2.2:
Post-bootstrap compaction history in Cassandra 2.1:
id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged 0316ad60-1064-11e5-ad02-a18c989fb6f2 system compactions_in_progress 1434045743926 473 80 {2:2} 031267a0-1064-11e5-ad02-a18c989fb6f2 stresscql ycsb 1434045743898 5249523 5249523 {1:511} 02e62780-1064-11e5-ad02-a18c989fb6f2 stresscql ycsb 1434045743608 92761358 56310515 {1:1933, 2:3548} ec4aa5f0-1063-11e5-ad02-a18c989fb6f2 system local 1434045705679 790 574 {4:1} SSTables in each level on destination node: [0, 6, 5, 0, 0, 0, 0, 0, 0]
In Cassandra 2.1 you can clearly see we had to re-compact the data to get our data into two levels on the new node.
Post-bootstrap compaction history in Cassandra 2.2:
id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged 81d3a590-1069-11e5-b0c0-a18c989fb6f2 system local 1434048104041 784 569 {4:1} SSTables in each level on destination node: [0, 6, 5, 0, 0, 0, 0, 0, 0]
In Cassandra 2.2 we can see there was no compaction and the data is already properly leveled on the new node!
Closing remarks:
By re-using the source SSTable level when bootstrapping a new node we can greatly improve our MTTB (Mean Time To Bootstrap) for data using the LeveledCompactionStrategy. Some other things to note about the process are regular cluster writes that occur during bootstrap will end up in level 0 upon flush per usual. Also, passing the level as part of streaming is only used for bootstrap or node replacement since the other streaming scenarios like repair can't guarantee that there are no overlaps in the existing leveled data.