Request tracing in Cassandra 1.2
Cassandra long ago moved beyond its original design of limiting reads to primary key lookups.
This additional power does bring with it some additional complexity. To make it easier to understand what Cassandra is doing under the hood, we've added request tracing to Cassandra 1.2.
Tracing an insert
As a simple example, consider this example table:
CREATE TABLE test (
a int PRIMARY KEY,
b text
);
INSERT INTO test (a, b) VALUES (1, 'example');
INSERT INTO test (a, b) VALUES (2, 'example');
INSERT INTO test (a, b) VALUES (3, 'example');
INSERT INTO test (a, b) VALUES (4, 'example');
INSERT INTO test (a, b) VALUES (5, 'example');
INSERT INTO test (a, b) VALUES (6, 'example');
INSERT INTO test (a, b) VALUES (7, 'example');
INSERT INTO test (a, b) VALUES (8, 'example');
INSERT INTO test (a, b) VALUES (9, 'example');
INSERT INTO test (a, b) VALUES (10, 'example');
Now we'll enable tracing and insert a row. (These numbers are from a ccm cluster with all three nodes on my macbook air on a cold JVM.)
cqlsh> tracing on;
Now tracing requests.
cqlsh:foo> INSERT INTO test (a, b) VALUES (1, 'example');
Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9
activity | timestamp | source | source_elapsed
-------------------------------------+--------------+-----------+----------------
execute_cql3_query | 00:02:37,015 | 127.0.0.1 | 0
Parsing statement | 00:02:37,015 | 127.0.0.1 | 81
Preparing statement | 00:02:37,015 | 127.0.0.1 | 273
Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540
Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779
Messsage received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63
Applying mutation | 00:02:37,016 | 127.0.0.2 | 220
Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250
Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277
Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378
Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710
Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888
Messsage received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334
Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550
Request complete | 00:02:37,017 | 127.0.0.1 | 2581
You can see that there are three distinct stages to a simple insert like this:
- The coordinator figures out which node(s) this row should be replicated to
- The replica (in yellow) appends the row to the commitlog, then adds it to the memtable
- The coordinator receives a confirmation from the replica and tells the client that the request was successful
Tracing a sequential scan
Let's look at a more complicated example. Here I've pre-inserted ten rows like the above one.
cqlsh:foo> SELECT * FROM test;
activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------+--------------+-----------+----------------
execute_cql3_query | 16:36:52,749 | 127.0.0.1 | 0
Parsing statement | 16:36:52,749 | 127.0.0.1 | 56
Peparing statement | 16:36:52,749 | 127.0.0.1 | 259
Determining replicas to query | 16:36:52,749 | 127.0.0.1 | 540
Executing seq scan across 0 sstables for [min(-9223372036854775808), max(0)] | 16:36:52,750 | 127.0.0.1 | 1522
Scanned 6 rows and matched 6 | 16:36:52,751 | 127.0.0.1 | 2062
Sending message to /127.0.0.2 | 16:36:52,751 | 127.0.0.1 | 2292
Message received from /127.0.0.1 | 16:36:52,751 | 127.0.0.2 | 37
Executing seq scan across 0 sstables for (max(0), max(3074457345618258602)] | 16:36:52,752 | 127.0.0.2 | 663
Scanned 2 rows and matched 2 | 16:36:52,755 | 127.0.0.2 | 3952
Enqueuing response to /127.0.0.1 | 16:36:52,755 | 127.0.0.2 | 4001
Sending message to /127.0.0.1 | 16:36:52,756 | 127.0.0.2 | 4366
Message received from /127.0.0.2 | 16:36:52,757 | 127.0.0.1 | 8019
Processing response from /127.0.0.2 | 16:36:52,757 | 127.0.0.1 | 8544
Sending message to /127.0.0.3 | 16:36:52,758 | 127.0.0.1 | 9117
Message received from /127.0.0.1 | 16:36:52,759 | 127.0.0.3 | 36
Executing seq scan across 0 sstables for (max(3074457345618258602), max(6148914691236517204)] | 16:36:52,764 | 127.0.0.3 | 4891
Scanned 1 rows and matched 1 | 16:36:52,765 | 127.0.0.3 | 6418
Enqueuing response to /127.0.0.1 | 16:36:52,765 | 127.0.0.3 | 6460
Sending message to /127.0.0.1 | 16:36:52,765 | 127.0.0.3 | 6874
Message received from /127.0.0.3 | 16:36:52,766 | 127.0.0.1 | 17470
Processing response from /127.0.0.3 | 16:36:52,766 | 127.0.0.1 | 17703
Executing seq scan across 0 sstables for (max(6148914691236517204), min(-9223372036854775808)] | 16:36:52,767 | 127.0.0.1 | 18473
Scanned 1 rows and matched 1 | 16:36:52,767 | 127.0.0.1 | 18681
Request complete | 16:36:52,768 | 127.0.0.1 | 19293
This is substantially longer, since we're doing a sequential scan across the whole cluster:
- The coordinator sets up the replicas to query
- (blue) The first replica queries is the coordinator, which has 6 rows
- (yellow) 127.0.0.2 has 2 rows
- (green)> 127.0.0.3 has 1 row
- (blue)> A second scan (of a different replication range) on the coordinator finds 1 more row
(CASSANDRA-4858 is open to merge the two queries against the coordinator to a single one.)
Tracing an indexed query
Now I'd like to examine using tracing to diagnose performance problems. Consider a simple table of users:
CREATE TABLE users (
id uuid PRIMARY KEY,
firstname text,
lastname text
);
CREATE INDEX ON users (firstname);
Now I've inserted a user named Bob Higginbigham, and 99,999 other users named Bob Smith.
activity | timestamp | source | source_elapsed
--------------------------------------------------------------+--------------+-----------+----------------
execute_cql3_query | 17:04:36,417 | 127.0.0.1 | 0
Parsing statement | 17:04:36,417 | 127.0.0.1 | 57
Peparing statement | 17:04:36,418 | 127.0.0.1 | 682
Determining replicas to query | 17:04:36,424 | 127.0.0.1 | 6623
[... removed: the coordinator scans itself... ]
Sending message to /127.0.0.2 | 17:04:46,433 | 127.0.0.1 | 105485
Message received from /127.0.0.1 | 17:04:46,439 | 127.0.0.2 | 29
Executing indexed scan for (max(0), max(3074457345618258602)] | 17:04:46,441 | 127.0.0.2 | 1844
Executing single-partition query on users.users_firstname_idx | 17:04:46,442 | 127.0.0.2 | 2692
Acquiring sstable references | 17:04:46,442 | 127.0.0.2 | 2712
Merging memtable contents | 17:04:46,442 | 127.0.0.2 | 2740
Scanned 32137 rows and matched 1 | 17:04:57,243 | 127.0.0.2 | 112842
Enqueuing response to /127.0.0.1 | 17:04:57,243 | 127.0.0.2 | 112861
Sending message to /127.0.0.1 | 17:04:57,243 | 127.0.0.2 | 112938
[... removed: the coordinator scands 127.0.0.3 ...]
Request complete | 17:04:46,458 | 127.0.0.1 | 291378
I've edited this trace to focus on the replica that owns Bob Higginbigham.
Note how Cassandra has to scan all 30,000+ rows (all 100,000 including the other machines) to find the Bob we're looking for, since we only have an index on firstname. Moral: Cassandra isn't black magic, you still need to create appropriate indexes to get the performance you want.
Tracing a queue antipattern
Here's a more subtle one.
It is tempting to use Cassandra as a durable queue, e.g.
CREATE TABLE queues (
id text,
created_at timeuuid,
value blob,
PRIMARY KEY (id, created_at)
);
This relies on Cassandra's clustering within a partition (where the partition is the queue id) to order queue entries by creation time. Then grabbing the most recent queue entry is just SELECT FROM queues WHERE id = 'myqueue' ORDER BY created_at LIMIT 1.
Here's what that looks like after creating and removing 100,000 entries:
activity | timestamp | source | source_elapsed
------------------------------------------+--------------+-----------+----------------
execute_cql3_query | 19:31:05,650 | 127.0.0.1 | 0
Sending message to /127.0.0.3 | 19:31:05,651 | 127.0.0.1 | 541
Messsage received from /127.0.0.1 | 19:31:05,651 | 127.0.0.3 | 39
Executing single-partition query | 19:31:05,652 | 127.0.0.3 | 943
Acquiring sstable references | 19:31:05,652 | 127.0.0.3 | 973
Merging memtable contents | 19:31:05,652 | 127.0.0.3 | 1020
Merging data from memtables and sstables | 19:31:05,652 | 127.0.0.3 | 1081
Read 1 live cells and 100000 tombstoned | 19:31:05,686 | 127.0.0.3 | 35072
Enqueuing response to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35220
Sending message to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35314
Messsage received from /127.0.0.3 | 19:31:05,687 | 127.0.0.1 | 36908
Processing response from /127.0.0.3 | 19:31:05,688 | 127.0.0.1 | 37650
Request complete | 19:31:05,688 | 127.0.0.1 | 38047
Take a look in the middle here: "Read 1 live cells and 100000 tombstoned."
Because Cassandra uses a log-structured storage engine, deletes do not immediately remove all traces of a row. Instead, Cassandra writes a deletion marker called a tombstone that supresses the old data until it can be compacted away.
So what we see in this trace is Cassandra having to read past all the older, deleted entries, before it gets to one that is still alive -- a sign that you need to rethink your data model.
Availability
Tracing is part of Cassandra 1.2. Beta 2 is ready for testing; we expect the final release before the end of the year.