Improving Secondary Index Write Performance in 1.2
Introduced in version 0.7, secondary indexes provide a means to access data in Cassandra using attributes other than the row key. The two index implementations currently in the Cassandra codebase use an auxiliary column family to model an inverted index of values from a "primary" column family. The SecondaryIndex interface is an extension point, making it possible to plug in alternative index implementations.
While secondary indexes can add a lot of flexibility to the way data is modelled and accessed, they do add complexity on the server side as the indexes need to be kept in sync with the primary data. Until recently, this has led to some significant trade offs in write throughput and IO utilisation as we always had to perform a read before the write in order to update any relevant secondary indexes. In Cassandra 1.2, this area has been substantially reworked to remove the need for read-before-write. New index entries are now written at the same time as the primary data is updated and old entries removed lazily at query time. Overall, this has lead to some decent performance improvements.
Secondary Index Write Path Pre-1.2
To ensure consistency between primary data and secondary indexes, every time a RowMutation was received it was necessary to first determine which, if any, of the columns being mutated were configured with secondary indexes. This represented a fairly cheap operation as the information was already held in memory by the SecondaryIndexManager, of which there exists one per column family. If none of the columns affected by the mutation were marked as indexed, then no special processing was required and the mutation was simply applied by updating the relevant memtable. However, if any secondary indexes did need to be updated some additional work was required.
The main issue here was that in order to update the indexes, the previously indexed values of all the columns being mutated had to be obtained. This required a read of the index column family which potentially involved disk access to fetch data from SSTables. This additional read operation represented a real hit on write performance, going against the Cassandra principle of optimising for minimal write latency and maximal write throughput. Having identified the subset of updated columns which were indexed and having read the current values of those columns, the mutation was then applied to the column family containing the primary data. Finally, the secondary indexes were updated, by issuing deletes for the original indexed values, and inserts for the new ones. This guarded against the undesirable scenario where a row may be incorrectly retrieved during a query involving secondary indexes on the basis of stale values in the index.
Improvements in 1.2
So how to go about removing the read-before-write overhead? The answer has two parts: first, we pushed the updates to the secondary indexes down the stack so that they happen as close as possible (in the Cassandra code) to the primary data updates; second, we implemented a kind of read-repair for indexes. The JIRA for these changes is CASSANDRA-2897.
During an update, a closure is created which wraps the row key and index manager, allowing us to shuttle the index update logic to where it's most relevant. This closure is passed, along with the mutation, down into the memtable where the logic for performing atomic operations in memory is. For each mutated column we know both the new value and any old value that may be resident in the memtable. So without having to perform a separate read, which could touch disk, we update the relevant index at the same time.
Whilst this goes part of the way to ensuring that any secondary index will always accurately reflect the primary column family, some further checks were added to ensure consistency. Where secondary indexes are queried and matching rows from the primary column family retrieved, a check was added to ensure that the value obtained from the index still matched the one in the primary data. In the case where the two values did not match, perhaps because of some failure to update the index value, the row is dropped from the query resultset and the index value deleted (so that the same erroneous result isn't returned for subsequent queries).
In a simple single-node benchmark, using a highly contended column family and a dataset size which exceeds RAM, we see an ~11% improvement in write throughput. So these changes have allowed us to remove a nasty bottleneck in Cassandra's write path and help make using secondary indexes a less costly exercise all round.
Producing an initial patch for CASSANDRA-2897 was part of the technical interview that I sat prior to joining DataStax. If this sounds like the sort of problem you enjoy working on, drop us a line or check out our open positions . You can also read more about our distributed setup and hiring process.