TechnologyNovember 13, 2012

Request tracing in Cassandra 1.2

Request tracing in Cassandra 1.2

Cassandra long ago moved beyond its original design of limiting reads to primary key lookups.

This additional power does bring with it some additional complexity. To make it easier to understand what Cassandra is doing under the hood, we've added request tracing to Cassandra 1.2.

Tracing an insert

As a simple example, consider this example table:

 

CREATE TABLE test (
  a int PRIMARY KEY,
  b text
);

INSERT INTO test (a, b) VALUES (1, 'example');
INSERT INTO test (a, b) VALUES (2, 'example');
INSERT INTO test (a, b) VALUES (3, 'example');
INSERT INTO test (a, b) VALUES (4, 'example');
INSERT INTO test (a, b) VALUES (5, 'example');
INSERT INTO test (a, b) VALUES (6, 'example');
INSERT INTO test (a, b) VALUES (7, 'example');
INSERT INTO test (a, b) VALUES (8, 'example');
INSERT INTO test (a, b) VALUES (9, 'example');
INSERT INTO test (a, b) VALUES (10, 'example');

 

Now we'll enable tracing and insert a row. (These numbers are from a ccm cluster with all three nodes on my macbook air on a cold JVM.)

 

cqlsh> tracing on;
Now tracing requests.

cqlsh:foo> INSERT INTO test (a, b) VALUES (1, 'example');
Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9

 activity                            | timestamp    | source    | source_elapsed
-------------------------------------+--------------+-----------+----------------
                  execute_cql3_query | 00:02:37,015 | 127.0.0.1 |              0
                   Parsing statement | 00:02:37,015 | 127.0.0.1 |             81
                 Preparing statement | 00:02:37,015 | 127.0.0.1 |            273
   Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 |            540
       Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 |            779

   Messsage received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 |             63
                   Applying mutation | 00:02:37,016 | 127.0.0.2 |            220
                Acquiring switchLock | 00:02:37,016 | 127.0.0.2 |            250
              Appending to commitlog | 00:02:37,016 | 127.0.0.2 |            277
                  Adding to memtable | 00:02:37,016 | 127.0.0.2 |            378
    Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 |            710
       Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 |            888

   Messsage received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 |           2334
 Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 |           2550
                    Request complete | 00:02:37,017 | 127.0.0.1 |           2581

 

You can see that there are three distinct stages to a simple insert like this:

  1. The coordinator figures out which node(s) this row should be replicated to
  2. The replica (in yellow) appends the row to the commitlog, then adds it to the memtable
  3. The coordinator receives a confirmation from the replica and tells the client that the request was successful

Tracing a sequential scan

Let's look at a more complicated example. Here I've pre-inserted ten rows like the above one.

 

cqlsh:foo> SELECT * FROM test;

 activity                                                                                       | timestamp    | source    | source_elapsed
------------------------------------------------------------------------------------------------+--------------+-----------+----------------
                                                                             execute_cql3_query | 16:36:52,749 | 127.0.0.1 |              0
                                                                              Parsing statement | 16:36:52,749 | 127.0.0.1 |             56
                                                                             Peparing statement | 16:36:52,749 | 127.0.0.1 |            259
                                                                  Determining replicas to query | 16:36:52,749 | 127.0.0.1 |            540

                   Executing seq scan across 0 sstables for [min(-9223372036854775808), max(0)] | 16:36:52,750 | 127.0.0.1 |           1522
                                                                   Scanned 6 rows and matched 6 | 16:36:52,751 | 127.0.0.1 |           2062

                                                                  Sending message to /127.0.0.2 | 16:36:52,751 | 127.0.0.1 |           2292

                                                               Message received from /127.0.0.1 | 16:36:52,751 | 127.0.0.2 |             37
                    Executing seq scan across 0 sstables for (max(0), max(3074457345618258602)] | 16:36:52,752 | 127.0.0.2 |            663
                                                                   Scanned 2 rows and matched 2 | 16:36:52,755 | 127.0.0.2 |           3952
                                                               Enqueuing response to /127.0.0.1 | 16:36:52,755 | 127.0.0.2 |           4001
                                                                  Sending message to /127.0.0.1 | 16:36:52,756 | 127.0.0.2 |           4366

                                                               Message received from /127.0.0.2 | 16:36:52,757 | 127.0.0.1 |           8019
                                                            Processing response from /127.0.0.2 | 16:36:52,757 | 127.0.0.1 |           8544
                                                                  Sending message to /127.0.0.3 | 16:36:52,758 | 127.0.0.1 |           9117

                                                               Message received from /127.0.0.1 | 16:36:52,759 | 127.0.0.3 |             36
  Executing seq scan across 0 sstables for (max(3074457345618258602), max(6148914691236517204)] | 16:36:52,764 | 127.0.0.3 |           4891
                                                                   Scanned 1 rows and matched 1 | 16:36:52,765 | 127.0.0.3 |           6418
                                                               Enqueuing response to /127.0.0.1 | 16:36:52,765 | 127.0.0.3 |           6460
                                                                  Sending message to /127.0.0.1 | 16:36:52,765 | 127.0.0.3 |           6874

                                                               Message received from /127.0.0.3 | 16:36:52,766 | 127.0.0.1 |          17470
                                                            Processing response from /127.0.0.3 | 16:36:52,766 | 127.0.0.1 |          17703
 Executing seq scan across 0 sstables for (max(6148914691236517204), min(-9223372036854775808)] | 16:36:52,767 | 127.0.0.1 |          18473

                                                                   Scanned 1 rows and matched 1 | 16:36:52,767 | 127.0.0.1 |          18681

                                                                               Request complete | 16:36:52,768 | 127.0.0.1 |          19293

 

This is substantially longer, since we're doing a sequential scan across the whole cluster:

  1. The coordinator sets up the replicas to query
  2. (blue) The first replica queries is the coordinator, which has 6 rows
  3. (yellow) 127.0.0.2 has 2 rows
  4. (green)> 127.0.0.3 has 1 row
  5. (blue)> A second scan (of a different replication range) on the coordinator finds 1 more row

(CASSANDRA-4858 is open to merge the two queries against the coordinator to a single one.)

Tracing an indexed query

Now I'd like to examine using tracing to diagnose performance problems. Consider a simple table of users:

 

CREATE TABLE users (
  id uuid PRIMARY KEY,
  firstname text,
  lastname text
);

CREATE INDEX ON users (firstname);

 

Now I've inserted a user named Bob Higginbigham, and 99,999 other users named Bob Smith.

 

 activity                                                     | timestamp    | source    | source_elapsed
--------------------------------------------------------------+--------------+-----------+----------------
                                           execute_cql3_query | 17:04:36,417 | 127.0.0.1 |              0
                                            Parsing statement | 17:04:36,417 | 127.0.0.1 |             57
                                           Peparing statement | 17:04:36,418 | 127.0.0.1 |            682
                                Determining replicas to query | 17:04:36,424 | 127.0.0.1 |           6623
                                [... removed: the coordinator scans itself... ]
                                Sending message to /127.0.0.2 | 17:04:46,433 | 127.0.0.1 |         105485

                             Message received from /127.0.0.1 | 17:04:46,439 | 127.0.0.2 |             29
Executing indexed scan for (max(0), max(3074457345618258602)] | 17:04:46,441 | 127.0.0.2 |           1844
Executing single-partition query on users.users_firstname_idx | 17:04:46,442 | 127.0.0.2 |           2692
                                 Acquiring sstable references | 17:04:46,442 | 127.0.0.2 |           2712
                                    Merging memtable contents | 17:04:46,442 | 127.0.0.2 |           2740
                             Scanned 32137 rows and matched 1 | 17:04:57,243 | 127.0.0.2 |         112842
                             Enqueuing response to /127.0.0.1 | 17:04:57,243 | 127.0.0.2 |         112861
                                Sending message to /127.0.0.1 | 17:04:57,243 | 127.0.0.2 |         112938

                                [... removed: the coordinator scands 127.0.0.3 ...]
                                             Request complete | 17:04:46,458 | 127.0.0.1 |         291378

 

I've edited this trace to focus on the replica that owns Bob Higginbigham.

Note how Cassandra has to scan all 30,000+ rows (all 100,000 including the other machines) to find the Bob we're looking for, since we only have an index on firstname. Moral: Cassandra isn't black magic, you still need to create appropriate indexes to get the performance you want.

Tracing a queue antipattern

Here's a more subtle one.

It is tempting to use Cassandra as a durable queue, e.g.

 

CREATE TABLE queues (
  id text,
  created_at timeuuid,
  value blob,
  PRIMARY KEY (id, created_at)
);

 

This relies on Cassandra's clustering within a partition (where the partition is the queue id) to order queue entries by creation time. Then grabbing the most recent queue entry is just SELECT FROM queues WHERE id = 'myqueue' ORDER BY created_at LIMIT 1.

Here's what that looks like after creating and removing 100,000 entries:

 

 activity                                 | timestamp    | source    | source_elapsed
------------------------------------------+--------------+-----------+----------------
                       execute_cql3_query | 19:31:05,650 | 127.0.0.1 |              0
            Sending message to /127.0.0.3 | 19:31:05,651 | 127.0.0.1 |            541

        Messsage received from /127.0.0.1 | 19:31:05,651 | 127.0.0.3 |             39
         Executing single-partition query | 19:31:05,652 | 127.0.0.3 |            943
             Acquiring sstable references | 19:31:05,652 | 127.0.0.3 |            973
                Merging memtable contents | 19:31:05,652 | 127.0.0.3 |           1020
 Merging data from memtables and sstables | 19:31:05,652 | 127.0.0.3 |           1081
  Read 1 live cells and 100000 tombstoned | 19:31:05,686 | 127.0.0.3 |          35072
         Enqueuing response to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 |          35220
            Sending message to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 |          35314

        Messsage received from /127.0.0.3 | 19:31:05,687 | 127.0.0.1 |          36908
      Processing response from /127.0.0.3 | 19:31:05,688 | 127.0.0.1 |          37650
                         Request complete | 19:31:05,688 | 127.0.0.1 |          38047

 

Take a look in the middle here: "Read 1 live cells and 100000 tombstoned."

Because Cassandra uses a log-structured storage engine, deletes do not immediately remove all traces of a row. Instead, Cassandra writes a deletion marker called a tombstone that supresses the old data until it can be compacted away.

So what we see in this trace is Cassandra having to read past all the older, deleted entries, before it gets to one that is still alive -- a sign that you need to rethink your data model.

Availability

Tracing is part of Cassandra 1.2. Beta 2 is ready for testing; we expect the final release before the end of the year.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.