Queues and Queue-like Datasets in Cassandra

Deletes in Cassandra

Cassandra uses a log-structured storage engine. Because of this, deletes do not remove the rows and columns immediately and in-place. Instead, Cassandra writes a special marker, called a tombstone, indicating that a row, column, or range of columns was deleted. These tombstones are kept for at least the period of time defined by the gc_grace_seconds per-table setting. Only then a tombstone can be permanently discarded by compaction.

This scheme allows for very fast deletes (and writes in general), but it's not free: aside from the obvious RAM/disk overhead of tombstones, you might have to pay a certain price when reading data back if you haven't modeled your data well.

Specifically, tombstones will bite you if you do lots of deletes (especially column-level deletes) and later perform slice queries on rows with a lot of tombstones.

Symptoms of a wrong data model

To illustrate this scenario, let's consider the most extreme case - using Cassandra as a durable queue, a known anti-pattern, e.g.

CREATE TABLE queues (

    name text,

    enqueued_at timeuuid,

    payload blob,

    PRIMARY KEY (name, enqueued_at)

);

Having enqueued 10000 10-byte messages and then dequeued 9999 of them, one by one, let's peek at the last remaining message using cqlsh with TRACING ON:

SELECT enqueued_at, payload

  FROM queues

 WHERE name = 'queue-1'

 LIMIT 1;

activity                                   | source    | elapsed

-------------------------------------------+-----------+--------

                        execute_cql3_query | 127.0.0.3 |       0

                         Parsing statement | 127.0.0.3 |      48

                        Peparing statement | 127.0.0.3 |     362

          Message received from /127.0.0.3 | 127.0.0.1 |      42

             Sending message to /127.0.0.1 | 127.0.0.3 |     718

Executing single-partition query on queues | 127.0.0.1 |     145

              Acquiring sstable references | 127.0.0.1 |     158

                 Merging memtable contents | 127.0.0.1 |     189

Merging data from memtables and 0 sstables | 127.0.0.1 |     235

    Read 1 live and 19998 tombstoned cells | 127.0.0.1 |  251102

          Enqueuing response to /127.0.0.3 | 127.0.0.1 |  252976

             Sending message to /127.0.0.3 | 127.0.0.1 |  253052

          Message received from /127.0.0.1 | 127.0.0.3 |  324314

       Processing response from /127.0.0.1 | 127.0.0.3 |  324535

                          Request complete | 127.0.0.3 |  324812

Now even though the whole row was still in memory, the request took more than 300 milliseconds (all the numbers are from a 3-node ccm cluster running on a 2012 MacBook Air).

Why did the query take so long to complete?

A slice query will keep reading columns until one of the following condition is met (assuming regular, non-reverse order):

the specified limit of live columns has been read
a column beyond the finish column has been read (if specified)
all columns in the row have been read

In the previous scenario Cassandra had to read 9999 tombstones (and create 9999 DeletedColumn objects) before it could get to the only live entry. And all the collected tombstones 1) were consuming heap and 2) had to be serialised and sent back to the coordinator node along with the single live column.

For comparison, it took less than 1 millisecond for the same query to complete when no column-level tombstones were involved.

The queue example might be extreme, but you'll see the same behaviour when performing slice queries on any row with lots of deleted columns. Also, expiring columns, while more subtle, are going to have the same effect on slice queries once they expire and become tombstones.

Potential workarounds

If you are seeing this pattern (have to read past many deleted columns before getting to the live ones), chances are that you got your data model wrong and must fix it.

For example, consider partitioning data with heavy churn rate into separate rows and deleting the entire rows when you no longer need them. Alternatively, partition it into separate tables and truncate them when they aren't needed anymore.

In other words, if you use column-level deletes (or expiring columns) heavily and also need to perform slice queries over that data, try grouping columns with close 'expiration date' together and getting rid of them in a single move.

When you know where your live columns begin

Note that it's possible to improve on this hypothetical queue scenario. Specifically, when knowing what the last entry was, a consumer can specify the start column and thus somewhat mitigate the effect of tombstones by not having to either 1) start scanning at the beginning of the row and 2) collect and keep all the irrelevant tombstones in memory.

To show what I mean, let's modify the original example by using the previously consumed entry's key as the start column for the query, i.e.

SELECT enqueued_at, payload

  FROM queues

 WHERE name = 'queue-1'

   AND enqueued_at > 9d1cb818-9d7a-11b6-96ba-60c5470cbf0e

 LIMIT 1;

activity                                   | source    | elapsed

-------------------------------------------+-----------+--------

                        execute_cql3_query | 127.0.0.3 |       0

                         Parsing statement | 127.0.0.3 |      45

                        Peparing statement | 127.0.0.3 |     329

             Sending message to /127.0.0.1 | 127.0.0.3 |     965

          Message received from /127.0.0.3 | 127.0.0.1 |      34

Executing single-partition query on queues | 127.0.0.1 |     339

              Acquiring sstable references | 127.0.0.1 |     355

                 Merging memtable contents | 127.0.0.1 |     461

 Partition index lookup over for sstable 3 | 127.0.0.1 |    1122

Merging data from memtables and 1 sstables | 127.0.0.1 |    2268

        Read 1 live and 0 tombstoned cells | 127.0.0.1 |    4404

          Message received from /127.0.0.1 | 127.0.0.3 |    6109

          Enqueuing response to /127.0.0.3 | 127.0.0.1 |    4492

             Sending message to /127.0.0.3 | 127.0.0.1 |    4606

       Processing response from /127.0.0.1 | 127.0.0.3 |    6608

                          Request complete | 127.0.0.3 |    6901

Despite reading from disk this time, the complete request took 7 milliseconds. Specifying a start column allowed to start scanning the row close to the actual live column and to skip collecting all the tombstones. The difference grows larger with size of the row increasing.

Summary

Lots of deleted columns (also expiring columns) and slice queries don't play well together. If you observe this pattern in your cluster, you should correct your data model.
If you know where your live data begins, hint Cassandra with a start column, to reduce the scan times and the amount of tombstones to collect.
Do not use Cassandra to implement a durable queue.