What’s New in CQL 3.0

Cassandra 1.1 brings with it some important new developments in the CQL modeling and query language. The changes are deep enough, in fact, that they require a whole new version of the CQL language specification. Most of these changes are related to the big push underway to encourage the use of composite keys for simpler denormalization. See the earlier post on schema in Cassandra 1.1 for a quick explanation on how composites can help avoid manual denormalization when you need to lookup on multiple dimensions of your data.

The new version of CQL is numbered 3.0. To remind you, the flavor of CQL we have already come to know and love in previous versions of Cassandra is CQL 2.0. And don’t worry, CQL 2 is still supported in Cassandra 1.1 for backwards compatibility, but we recommend new applications be developed with CQL 3.

To use CQL 3, tell your CQL driver that you intend to use it. The method to do so differs by the driver. For example, in the Python driver, you specify the desired CQL version in the connect() call:

  connection = cql.connect('localhost:9160', cql_version='3.0.0')

The biggest change in CQL 3 is that the CREATE TABLE syntax now allows multi-column primary keys. As explained in the schema post, this is translated underneath the covers to wide Cassandra rows using composite column names, so it’s becoming more important to understand that storage-engine rows are not the same thing as CQL rows. If you create and use a CQL 3 table with a compound primary key, then inspect the corresponding column family using the Thrift interface, you will see the differences in layout.

Now, to be more specific, the CQL 3 CREATE TABLE syntax still allows table creation in much the same way as before, for simple keys:

   CREATE TABLE crew (
       name text PRIMARY KEY,
       bunk int,
       alliance_trouble boolean
   );

But now you have the option of using more, using a syntax borrowed from SQL:

   CREATE TABLE altercations (
       instigator text,
       started_at timestamp,
       ships_destroyed int,
       energy_used float,
       alliance_involvement boolean,
       PRIMARY KEY (instigator, started_at)
   );

A CQL 3 table’s primary key can have any number (1 or more) of component columns, but there must be at least one column which is not part of the primary key.

Inserting data into these composite-key tables is, as you might expect, the same as before:

   INSERT INTO crew (name, bunk, alliance_trouble) VALUES ('River Tam', 13, 'true');

   INSERT INTO altercations (instigator, started_at, ships_destroyed,
                             energy_used, alliance_involvement)
                     VALUES ('Jayne Cobb', '7943-07-23', 2, 4.6, 'false');

   UPDATE crew SET bunk=9 WHERE name='Book';

   UPDATE altercations SET energy_used = 2.3
          WHERE instigator='Jubal Early' AND started_at='7943-12-19';

Selecting data from these tables is likewise nearly the same, but there are a few extra requirements in order to help avoid doing things inefficiently. In CQL 2, WHERE clauses had to operate on the table key alone (or on an indexed column alone). Obviously, there is now a need for selecting on multiple values at the same time:

   SELECT * FROM altercations WHERE instigator='Jayne Cobb' AND started_at > '7943-08-24';

   SELECT energy_used, ships_destroyed FROM altercations
          WHERE instigator='Malcolm Reynolds' ORDER BY started_at DESC;

Notice that WHERE clauses can now include greater-than/less-than comparisons on columns other than the first. As long as all previous key-component columns have already been identified with strict = comparisons, the last given key component column can be any sort of comparison.

Composite keys means it now makes sense for CQL to sport the ORDER BY syntax in SELECT queries as well, but it’s still not nearly as flexible as you might be used to, doing ad-hoc queries in SQL. ORDER BY clauses can only select a single column, and that column has to be the second column in a composite PRIMARY KEY. This holds even for tables with more than 2 column components in the primary key. Ordering can be done in ascending or descending order, default ascending, and specified with the ASC or DESC keywords.

An important change that comes with this new emphasis on component-key-based denormalization is that direct access to the storage engine via column ranges is no longer supported in CQL 3. CQL 3 allows performing those queries with normal predicates in the WHERE clause instead, operating on logical columns. You just need to represent your wide-row data in a way that can fit a CQL 3 table (this should be possible for anything you have already modeled for Cassandra). All of this means there is no column range operator “..”, and no FIRST or REVERSED clauses for column selection in SELECT queries. They’re not necessary anymore.

Here’s an example of this sort of adjustment to a data model. Suppose Captain Reynolds keeps track of every ship registered by his sensors as he flies through space. Using CQL 2, he might have had a column family like this:

   CREATE TABLE seen_ships (day text PRIMARY KEY)
                WITH comparator=timestamp AND default_validation=text;

   INSERT INTO seen_ships (day, '7943-02-02 00:08:22')
                   VALUES ('199-A/4', 'Sunrise Avenger');

   SELECT FIRST 12 '7943-02-03'..'7943-02-28' FROM seen_ships WHERE day='199-A/4';

So each day-code is a Cassandra row, and events are added in with timestamps. Rows could become extremely wide, depending on how bad traffic is in space that day. One simple way to adjust this to CQL 3 would be something like:

   CREATE TABLE seen_ships (
       day text,
       time_seen timestamp,
       shipname text,
       PRIMARY KEY (day, time_seen)
   );

   SELECT * FROM seen_ships WHERE day='199-A/4'
            AND time_seen > '7943-02-03' AND time_seen < '7943-02-28'
            LIMIT 12;

Notice there shouldn’t be a need to set a custom comparator or default_validation class in CQL 3 either.

If you’re paying close attention and also quite space-conscious, you may have noticed that while this alternative may work for Captain Reynolds, it’s going to cost more in space than it would have under the CQL 2 approach—since for every piece of data stored, the column name needs to be stored along with it. This is the reason for the WITH COMPACT STORAGE directive discussed in the already-mentioned schema blog. With this addition:

   CREATE TABLE seen_ships (
       day text,
       time_seen timestamp,
       shipname text,
       PRIMARY KEY (day, time_seen)
   ) WITH COMPACT STORAGE;

Mal’s data will be stored essentially the same as it was under CQL 2. One downside is that it will be trickier to add or remove columns to the model with this approach (in fact, support for adding columns to COMPACT STORAGE tables is not yet ready, and it may never be worth adding support for deleting them).

That pretty much covers the major additions to CQL. There are some smaller changes as well, but not very many. For example, column names which don’t parse as valid CQL identifiers now need to be enclosed in double quotes (in CQL 2, they had to be enclosed in single quotes, just like string literals). But we’re probably past the point here where it’s interesting enough for a blog post.

CQL 3 may not yet be as well supported as CQL 2 by Cassandra-related tools and drivers (I’m looking at you, cqlsh!) but at least minimal support should exist in all major tools and drivers, and ubiquitous first-class support is a near-term goal.

EDIT 2012-04-21 15:07 UTC – clarified that no functionality has been removed in CQL 3.
EDIT 2012-04-21 15:11 UTC – changed claim about possibility of adding columns to COMPACT STORAGE- apparently support is in the works
EDIT 2012-04-23 15:58 UTC – fixed link to previous schema blog post
EDIT 2012-04-30 21:25 UTC – fixed argument to python cql.connect() function for getting CQL 3.
EDIT 2012-08-19 04:31 UTC – quoted “true” and “false” literals to be valid modern CQL 3.