CompanyNovember 17, 2014

What’s Coming to Cassandra in 3.0: Improved Hint Storage and Delivery

What’s Coming to Cassandra in 3.0: Improved Hint Storage and Delivery

What is Hinted Handoff?

Hinted handoff is an important part of the Cassandra write path - it allows us to reduce the inconsistency window caused by temporary node unavailability periods. A comprehensive blog post by Jonathan Ellis explains why HH is useful and how modern HH works - in two years that passed since then the implementation at the high level has basically remained the same.

Current Hinted Handoff Implementation

In Cassandra 2.1 and below, hints are stored in a regular Cassandra table in the local system keyspace - system.hints. Here is its schema:

CREATE TABLE system.hints (
    target_id uuid,
    hint_id timeuuid,
    message_version int,
    mutation blob,
    PRIMARY KEY ((target_id), hint_id, message_version)
) WITH COMPACT STORAGE;

target_id - target node’s unique host id - is the partition key here; hint_id is a unique Class 1 UUID; message_version stores the Cassandra version used to serialise the mutation, and mutation is used to store the actual serialised Mutation that couldn’t be delivered to the node - the hint to replay. Partitioning by the host id means that all the hints for a particular node belong to one partition, internally; clustering by the time-based hint id means that new hints get appended to the end of that logical partition. To avoid resurrecting deleted rows during hints replay, all the entries have TTL set to the smallest gc_grace_seconds of all the tables in the hinted mutation.

In this implementation saving a hint for an unresponsive node is as simple as doing an INSERT INTO system.hints .. query, internally. And replay isn’t difficult, either. To deliver hints to a recovered node, Cassandra simply scans the partition with target_id = the node’s host id (in a paginated fashion), deserialises each mutation from the mutation blob, then sends the mutation to the node, and deletes the delivered hint from the hints table (by simply writing a tombstone). If all the hints have been successfully delivered, we do flush, and run a major compaction, to get rid of all the accumulated tombstones and leave no trace of the delivered hints *.

It’s a simple mechanism. And it allows us to reuse what we already have - our battle tested storage engine - for hints storage and delivery. It lets us reuse streaming to simply stream the node’s hints to a different one when we decommission it, too. But this reusability and simplicity comes with a price.

Hinted Handoff as Queue Anti-Pattern

If you look at the current storage/replay mechanism carefully, you’ll notice that it’s based on perhaps the worst Cassandra anti-pattern there is - the queue!

You can read my previous blog post for more details on how queues and Cassandra don’t get along well.

There are two ways that hints replay can hit a large number of tombstones and blow up:

  1. During previous replay for the target node some of the mutations timed out, and delivery got aborted, so that post-delivery compaction wasn’t triggered.
  2. The target node has been down for a while, and it ended up accumulating a large number of expired hints (remember - all the hints are TTLd, to preserve correctness).

You can read CASSANDRA–6666 and CASSANDRA–6998 tickets to gain deeper understanding of both scenarios.

The 3.0 Way

In order to fix that and reduce the overall overhead surrounding hints, we’ve decided to completely rewrite the implementation of hints in Cassandra 3.0.

Starting with 3.0, Cassandra will simply store hints in flat files, bypassing the storage engine altogether.

As noticed previously, Cassandra’s storage engine introduces a lot of overhead for something as trivial as storage of hints: they are immutable, write-once data, that we only read once and then discard after replay. We don’t care about the order we write them in, and we ultimately don’t care about partition-level isolation when writing to system.hints.

Storing hints in flat files - a-la per-node commit logs - allows us to avoid that overhead:

  • we no longer need to go through the memtable and the commit log on the write path
  • we no longer need to perform IO - and - CPU - consuming compaction for hints
  • we no longer suffer from contention on huge system.hints partitions when a node is down - which can be a very serious issue

Saving a hint in Cassandra 3.0 is as trivial as appending the serialised mutation to a file.

Hints Replay in 3.0

Having hints in regular flat (segmented) files allows us to simplify optimise replay process as well.

Starting with 3.0, replay no longer operates on individual hints, competing for MUTATION stage with other writes. Instead, we will stream hints in bulk, segment by segment, and let the receiving node apply them locally. After streaming a segment to the target node, the replaying node can simply discard the replayed segment - by removing the file - with no tombstones involved.

  • CASSANDRA–6998 fix, included in Cassandra 2.0.11 and 2.1.1, drastically improves the situation around the tombstones during replay. Still, only a comprehensive rewrite can deal with the overhead of compaction and properly cure CASSANDRA–7545.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.