What’s new in Cassandra 0.7: Secondary indexes
Deprecation warning
This post covers the obsolete Cassandra 0.7. Modern Cassandra manipulates indexes using CQL.
Overview
In Cassandra, indexes on column values are called "secondary indexes," to distinguish them from the index on the row key that all ColumnFamilies have. Secondary indexes allow querying by value and can be built in the background automatically without blocking reads or writes.
The best way to explain secondary indexes is by example. Let's start the Cassandra CLI and create a users ColumnFamily:
$ bin/cassandra-cli --host localhost
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
[default@unknown] create keyspace demo;
[default@unknown] use demo;
[default@demo] create column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS}];
Here we've defined two columns: full_name, which isn't indexed but is required to be a UTF8 String, and birth_date, which we are indexing.
For Cassandra 0.7.0 only the KEYS index type is supported; this is similar to a hash index. Support for bitmap indexes is being worked on for a future release.
Next we add some users:
[default@demo] set users[bsanderson][full_name] = 'Brandon Sanderson';
[default@demo] set users[bsanderson][birth_date] = 1975;
[default@demo] set users[prothfuss][full_name] = 'Patrick Rothfuss';
[default@demo] set users[prothfuss][birth_date] = 1973;
[default@demo] set users[htayler][full_name] = 'Howard Tayler';
[default@demo] set users[htayler][birth_date] = 1968;
Now we can ask Cassandra for users born in a given year:
[default@demo] get users where birth_date = 1973;
-------------------
RowKey: prothfuss
=> (column=birth_date, value=1973, timestamp=1291333944389000)
=> (column=full_name, value=Patrick Rothfuss, timestamp=1291333940538000)
Adding an index
Now, suppose we now want to find all the users in a given state. In older versions of Cassandra, we'd need to create a ColumnFamily named (say) users_by_state, whose row keys were the state names and whose columns were the users in that state -- sort of a materialized view in each row.
This works fine, but it has some drawbacks: it's a fair amount of boilerplate to maintain these in your application, and when you add query types, back-populating the materialized views for extant data is a chore (although Hadoop support helps).
Secondary indexes automate this. Let's add some state data:
[default@demo] set users[bsanderson][state] = 'UT';
[default@demo] set users[prothfuss][state] = 'WI';
[default@demo] set users[htayler][state] = 'UT';
Note that even though state is not indexed yet, we can include the new state data in a query as long as another column in the query is indexed:
[default@demo] get users where state = 'UT';
No indexed columns present in index clause with operator EQ
[default@demo] get users where state = 'UT' and birth_date > 1970;
No indexed columns present in index clause with operator EQ
[default@demo]get users where birth_date = 1968 and state = 'UT';
-------------------
RowKey: htayler
=> (column=birth_date, value=1968, timestamp=1291334765649000)
=> (column=full_name, value=Howard Tayler, timestamp=1291334749160000)
=> (column=state, value=5554, timestamp=1291334890708000)
One consequence of the KEYS index type being more like a hash index than a btree is shown here: even though birth_date is indexed, Cassandra couldn't perform the range query "> 1970" against it.
We also see above that the CLI doesn't know how to interpret the value of the state column since we haven't told it what kind of data is in it yet. We'll add that at the same time as the new index; then we can query the state column alone:
[default@demo] update column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS},
... {column_name: state, validation_class: UTF8Type, index_type: KEYS}];
Now we can query against the state column alone or with other columns:
[default@demo] get users where state = 'UT';
-------------------
RowKey: bsanderson
=> (column=birth_date, value=1975, timestamp=1291333936242000)
=> (column=full_name, value=Brandon Sanderson, timestamp=1291333931790000)
=> (column=state, value=UT, timestamp=1291334909266000)
-------------------
RowKey: htayler
=> (column=birth_date, value=1968, timestamp=1291334765649000)
=> (column=full_name, value=Howard Tayler, timestamp=1291334749160000)
=> (column=state, value=UT, timestamp=1291334890708000)
[default@demo] get users where state = 'UT' and birth_date > 1970;
-------------------
RowKey: bsanderson
=> (column=birth_date, value=1975, timestamp=1291333936242000)
=> (column=full_name, value=Brandon Sanderson, timestamp=1291333931790000)
=> (column=state, value=UT, timestamp=1291334909266000)
We can perform the range query now that the state column is also indexed, so Cassandra can use the state predicate as the primary and filter on the other with a nested loop.
Programatically
Different Cassandra clients may use different method names but the idea is the same. This last query in the pycassa Python client looks like this:
state_expr = pycassa.create_index_expression('state', 'UT')
birth_expr = pycassa.create_index_expression('birth_date', 1970, op=IndexOperator.GT)
clause = pycassa.create_index_clause([state_expr, bday_expr])
result = users.get_indexed_slices(clause):
In the Hector Java client:
StringSerializer ss = StringSerializer.get();
IndexedSlicesQuery<String, String, String> indexedSlicesQuery = HFactory.createIndexedSlicesQuery(keyspace, ss, ss, ss);
indexedSlicesQuery.setColumnNames("full_name", "birth_date", "state");
indexedSlicesQuery.addGtExpression("birth_date", 1970L);
indexedSlicesQuery.addEqualsExpression("state", "UT");
indexedSlicesQuery.setColumnFamily("users");
indexedSlicesQuery.setStartKey("");
QueryResult<OrderedRows<String, String, String>> result = indexedSlicesQuery.execute();
See the pycassa documentation and hector documentation for more details.