A deep look at the CQL WHERE clause
For hands-on exercises that allow you to try out the CQL statements in this blog, check out the Queries lesson from the Cassandra Fundamentals learning series.
While they share similar syntaxes, there are lots of differences between CQL and SQL. The reasons for these differences come mainly from the fact that Cassandra is dealing with distributed data and aims to prevent inefficient queries. One of the places where CQL differs a lot from SQL is the WHERE clause. The goal of this post is to describe what is supported by the CQL WHERE clause and the reasons why it differs from normal SQL.
Primary key columns
In Cassandra, two types of columns have a special role: the partition key columns and the clustering columns. Together, they will define your row primary key. The partition key columns are the first part of primary key and their role is to spread data evenly around the cluster. Rows will be spread around the cluster based on the hash of the partition keys. The clustering key columns are used to cluster the data of a partition, allowing a very efficient retrival of rows. Due to the differences in the role that they are playing, partition key, clustering and normal columns support different sets of restrictions within the WHERE clause. Futhermore, those sets of restrictions differ depending of the type of query: SELECT, UPDATE or DELETE.
WHERE clause restrictions for SELECT statements
Partition keys restrictions
The partition key columns support only two operators: = and IN
IN restriction
Prior to 2.2, the IN restrictions could only be applied to the last column of the partition key. So, for example, if your table was:
CREATE TABLE numberOfRequests ( cluster text, date text, time text, numberOfRequests int, PRIMARY KEY ((cluster, date), time) )
In 2.1, you could only use an IN operator on the date column. In 2.2 you can use the IN operator on any partition key column. By consequence, a query like:
SELECT * FROM numberOfRequests WHERE cluster IN ('cluster1', 'cluster2') AND date = '2015-05-06' AND time >= '12:00' AND time <= '14:00';
is valid since 2.2 but is invalid in the prior releases. This change makes CQL more consistent but you should nevertheless be careful with IN restrictions on partition key columns. The nice post of Ryan Svihla will give you a clear explanation on why you should try to avoid them. Another change, introduced with 2.2, is that the results are not returned anymore in the order in which the partition key where specified in the IN clause. From 2.2 onward, the results are returned in the natural order of the column type and duplicates are ignored.
Unrestricted partition key columns
Cassandra will require that you either restrict all the partition key columns, or none of them unless the query can use a secondary index. This means that a query like:
SELECT * FROM numberOfRequests WHERE cluster='cluster1' AND time ='12:00';
will be rejected as the date column is not restricted. The reason why is that Cassandra needs all the partition key columns to be able to compute the hash that will allow it to locate the nodes containing the partition. If no restrictions are specified on the partition keys but some are specified on the clustering keys, Cassandra will require ALLOW FILTERING to be added to the query. For more information on ALLOW FILTERING you should look at ALLOW FILTERING explained.
>, >=, <= and < operators
Cassandra distributes the partition accross the nodes using the selected partitioner. As only the ByteOrderedPartitioner keeps an ordered distribution of data Cassandra does not support >, >=, <= and < operator directly on the partition key. Instead, it allows you to use the >, >=, <= and < operator on the partition key through the use of the token function.
SELECT * FROM numberOfRequests WHERE token(cluster, date) > token('cluster1', '2015-06-03') AND token(cluster, date) <= token('cluster1', '2015-06-05') AND time = '12:00';
If you use a ByteOrderedPartitioner, you will then be able to perform some range queries over multiple partitions. You should nevertheless be careful. Using a ByteOrderedPartitioner is not recommended as it can result in unbalanced clusters.
Clustering column restrictions
Clustering columns support the =, IN, >, >=, <=, <, CONTAINS and CONTAINS KEY operators in single-column restrictions and the =, IN, >, >=, <= and < operators in multi-column restrictions.
Unrestricted clustering columns
The role of clustering columns is to cluster data within a partition. If you have the following table:
CREATE TABLE numberOfRequests ( cluster text, date text, datacenter text, hour int, minute int, numberOfRequests int, PRIMARY KEY ((cluster, date), datacenter, hour, minute))
The data will be stored per partition in the following way:
{datacenter: US_WEST_COAST {hour: 0 {minute: 0 {numberOfRequests: 130}} {minute: 1 {numberOfRequests: 125}} … {minute: 59 {numberOfRequests: 97}}} {hour: 1 {minute: 0 …
You can see that in order to retrieve data in an efficient way without a secondary index, you need to know all the clustering key columns for you selection. So, if you execute:
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour = 14 AND minute = 00;
Cassandra will find the data efficiently but if you execute:
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND hour = 14 AND minute = 0;
Cassandra will reject the query as it has to scan the entire partition to find the requested data, which is inefficient.
IN restrictions
Prior to 2.2 the IN restrictions on clustering columns are only allowed on the last clustering column. In 2.2, the IN restriction can be used on any column and the following query will work:
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour IN (14, 15) AND minute = 0;
The same set of data can be retrieved prior to 2.2 by using a multi-column IN restriction
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND (hour, minute) IN ((14, 0), (15, 0));
In 2.2, multi-column IN restrictions can be applied to any set of clustering columns.
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND (datacentre, hour) IN (('US_WEST_COAST', 14), (‘US_EAST_COAST’, 17)) AND minute = 0;
Prior to 2.2, multi-column IN restrictions can only be applied to the last set of clustering columns being restricted. By consequence, the previous query is invalid in 2.1. But following query is a perfectly valid one.
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND (hour) IN ((14), (15));
>, >=, <= and < restrictions
Single column slice restrictions are allowed only on the last clustering column being restricted. Therefore, the following queries are valid:
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour= 12 AND minute >= 0 AND minute <= 30;
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour >= 12;
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter > 'US';
But the following one is invalid:
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour >= 12 AND minute = 0;
Multi-column slice restrictions are allowed on the last set of clustering columns being restricted.
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WESTCOAST' AND (hour, minute) >= (12, 0) AND (hour, minute) <= (14, 0)
If both sides are specified for the slice, the restrictions must start with the same column. Therefore the query:
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND (hour, minute) >= (12, 30) AND (hour) < (14)
is valid but the following one is not:
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacentre = 'US_WEST_COAST' AND (hour, minute) >= (12, 0) AND (minute) <= (45)
CONTAINS and CONTAINS KEY restrictions
CONTAINS and CONTAINS KEY restrictions can only be used on collections when the query is using a secondary index.
Regular column restrictions
Regular columns can be restricted by =, >, >=, <= and <, CONTAINS or CONTAINS KEY restrictions if the query is a secondary index query. IN restrictions are not supported.
Secondary indices queries
Direct queries on secondary indices support only =, CONTAINS or CONTAINS KEY restrictions. The CONTAINS restriction can only be used on collection types. The CONTAINS KEY restriction can only be used on map for which the keys have been indexed. For example, if you have the following table:
CREATE TABLE contacts ( id int PRIMARY KEY, firstName text, lastName text, phones map
the following queries will be valid:
SELECT * FROM contacts WHERE firstname = 'Benjamin'; SELECT * FROM contacts WHERE phones CONTAINS KEY 'office'; SELECT * FROM contacts WHERE emails CONTAINS 'Benjamin@oops.com';
Secondary indices filtering
Secondary index queries allow you to restrict the returned results using the =, >, >=, <= and <, CONTAINS and CONTAINS KEY restrictions on non-indexed columns using filtering. Therefore, the following queries are valid, as long as ALLOW FILTERING is specified:
SELECT * FROM contacts WHERE firstname = 'Benjamin' AND lastname = 'Lerer' ALLOW FILTERING;
SELECT * FROM contacts WHERE phones CONTAINS KEY 'office' AND phones CONTAINS '0000.0000.0000' ALLOW FILTERING;
You should nevertheless be carefull with filtering. It can be an expensive operation.
Partition key restrictions and Secondary indices
When Cassandra must perform a secondary index query, it will contact all the nodes to check the part of the secondary index located on each node. If all the partition key components are restricted, Cassandra will use that information to query only the nodes that contains the specified partition keys, which will make the query more efficient. For secondary index queries, only = restrictions are supported on partition key columns.
Clustering column restrictions and Secondary indices
For each indexed value, Cassandra stores the full primary key (partition key columns + clustering columns) of each row containing the value. When an index query is performed, Casssandra will retrieve the primary keys of the rows containing the value from the index. It will then retrieve the rows from the table and perform any filtering needed on it. If the first clustering columns have been restricted, Cassandra will perform an early filtering on the primary keys returned by the index, making the filtering more efficient. For that type of filtering, Cassandra will accept the following clustering column restrictions: =, IN, >, >=, <= and <. So, if we add the following secondary index to the numberOfRequests table:
CREATE INDEX ON numberOfRequests (minute);
the following query is perfectly valid:
SELECT * FROM numberOfRequests WHERE cluster = 'cluster1' AND date = '2015-06-05' AND datacenter IN ('US_WEST_COAST', 'US_EAST_COAST') AND minute = 0 ALLOW FILTERING;
WHERE clause restrictions for the UPDATE and DELETE statements
In UPDATE and DELETE statements all the primary key columns must be restricted and the only allowed restrictions are:
- the single-column = on any partition key or clustering columns
- the single-column IN restriction on the last partition key column
CASSANDRA-6237 will address part of those limitations in 3.0 by adding support to UPDATE AND DELETE statements for:
- IN restrictions on any partition key column
- IN restictions on any clustering column
- EQ and IN multi-column restrictions on the clustering keys (mixed or not with single-column restrictions).
and by adding to DELETE statements support for range deletion of entire rows. IN restrictions will still not supported for conditional updates or deletes. Secondary index lookup for UPDATE and DELETE statements is not and will not be supported due to the risk involved with read before write.
Conclusion
To get the most out of Cassandra you will need to design your tables for the queries you want to perfom. Hopefully, the explanations provided in this post will help you in this task.
A last word about single and multi-column restrictions mix
In some versions of Cassandra mixing single and multi-column restrictions has been broken. It has been properly fixed in 2.0.15 and 2.1.5.