Introduction to Composite Columns
Editor's note
Manually dealing with composite columns is no longer necessary when modeling with the modern Cassandra Query Language.
For dealing with composite columns in the obsolete Thrift API, read on.
Composite columns
Data modeling in Apache Cassandra is probably one of the most difficult concepts for new users to grasp – particularly those with a lot of experience in traditional RDBMS systems. The elusive sweet spot of “sorted, wide rows” can be difficult to find with some models, particularly those where the column family currently relies on super columns or is “static” (similar in design to a table in an RDBMS modeling, say, the attributes of a user per row). Composite columns, the subject of this entry, are beneficial to adapting some of these models, as well as providing new indexing functionality to those workloads like time series data already known to perform well.
Sorted, wide rows are useful because they take excellent advantage of comparator ordering to provide efficient access into data by minimizing disk seeks. As data volume increases, they further cut down on the overhead associated with large numbers of skinny rows which make the optimizations like key indexes and bloom filters less effective. Composites can help adapt some models to take full advantage of these efficiencies by facilitating ordering of nested components.
This entry will go through some practical applications of the composite comparator type in an attempt to demystify their usage and present the usefulness of their application to your data model.
At a high level, composite comparators can be thought of simply as a comparator composed of several other types of comparators. Composite comparator provides the following major benefits for data modeling:
- custom inverted search indexes: when you want more control over the CF layout than a secondary index
- a replacement for super columns: both and a means to offset some of the worst performance penalties associated with such, as well as extend the model to provide and arbitrary level of nesting
- grouping otherwise static skinny rows into wider rows for greater efficiency
The current composite comparator implementations come in two forms: CompositeType and DynamicCompositeType. This entry will discuss the former.
If you want to understand some of the history of how comparators came about, you can take a look back to see how and why they were added to Apache Cassandra:
https://issues.apache.org/jira/browse/CASSANDRA-2231
Though long, this issue thread shows some good discussions on why certain choices were made. Worth a read if you ever want to explore composites at a code level. I also recommend the following presentation as a background for indexing techniques in general with Apache Cassandra http://www.slideshare.net/edanuff/indexing-in-cassandra (Note - Ed Anuff was the original contributor of the CompositeType comparator).
To see this functionality in action, we are going to experiment with some publicly accessible timezone data as our test set. In this case, we are storing the timezone for major cities in the United States. The format of this data is pretty simple and in raw form contains the following: two letter country code, two letter state/province code, city name, and timezone.
Where previously, we would have potentially relied on super columns or a static column family to model this data, in our composite-oriented model, we will combine the first three fields for the composite column name and the timezone as the column value. This has the benefit of being able to collapse the data into a single column. This will make a column of data look something like the following:
US:TX:Austin=America/Chicago
Note that for larger data sets, you would want to spread out the columns among rows in order avoid hotspots on any one node.
When we talk about composites, we can refer to the individual members as components. So, in this model, we have three components for the composite comparator name: The two letter country code, two letter state code, and city name. The value for the column is the timezone in which the city is located.
With this particular data model, we can explore some of the features of using composite comparators as an inverted search index to take full advantage of Apache Cassandra's storage format. We will use the Java client Hector for examples to see how to search broadly within a row, initially returning a few thousand results then increasingly narrow the search criteria to just a few records as we add clauses to the composite column range used in the slice query.
You can download, run and experiment with this code via the following project on github: http://github.com/zznate/cassandra-tutorial
Particularly, we will be looking at CompositeQuery and CompositeDataLoader (though new users to the Hector API or Apache Cassandra in general may find the rest of the project contents helpful as well).
So first, we'll need to create the keyspace and column family:
create keyspace Tutorial
with strategy_options = {replication_factor:1}
and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';
use Tutorial;
create column family CountryStateCity
with comparator = 'CompositeType(UTF8Type,UTF8Type,UTF8Type)'
and key_validation_class = 'UTF8Type'
and default_validation_class = 'UTF8Type';
Note how in the comparator declaration, we combine the types we are going to use according to the position of the relative component. The order is important and must be maintained once declared or operations will fail with InvalidRequestException, much as they would if you used the wrong type on any other non-composite column.
Now, with the column family in-place, we can insert some data using CompositeDataLoader* with the following invocation of maven:
mvn -e exec:java -Dexec.mainClass="com.datastax.tutorial.composite.CompositeDataLoader"
This class reads a CSV file from the data directory and inserts a few thousand columns of data under a single “ALL” key.
Now execute the CompositeQuery** class. The first set of results for which we are looking is all columns which are located in the United States (prefixed with “US”). The following query would be constructed as follows:
mvn -e exec:java -Dexec.mainClass="com.datastax.tutorial.composite.CompositeQuery"
You may have noticed the GREATER_THAN_EQUAL clause on the finish Composite column. If you are wondering why this is not an EQUAL clause, you are not alone. This is a common mistake for most users new to composites. The reason behind this has to do with how each component is encoded.
The encoding of a component is made up of three parts: the length of the value, the value itself, and a “end of component byte.” It is this last part, the e-o-c byte that controls slicing operations. In our case as detailed above, the value is 1. When applied to the finish component for the composite column of the slice operation, it means the “give me all the columns whose first component is 'US'” when used in conjunction with EQUAL on the start composite. We'll explore the other cases as we continue through the example.
So, with the current structure of the query, this example is not terribly interesting: it just returns all the columns prefixed with “US” which, in our subset of data, is the whole row.
Let's narrow the search range down to California (abbreviated as “CA”) in our second component. Like our first example, the start clause contains an EQUAL expression, the finish clause a GREATER_THAN_EQUAL. This give us all the columns for the state of California. Note that we can also change the first clause to EQUAL since we are dealing now with comparing the second component - this needs to be done to set the e-o-c bit back to zero so the composite comparator will move on to examining the next component. Not doing so will result in an InvalidRequestException.
Composite start = compositeFrom(startArg, Composite.ComponentEquality.EQUAL);
Composite end = compositeFrom(startArg, Composite.ComponentEquality.GREATER_THAN_EQUAL);
start.addComponent(1,"CA",Composite.ComponentEquality.EQUAL);
end.addComponent(1,"CA",Composite.ComponentEquality.GREATER_THAN_EQUAL);
Running CompositeQuery again will produce a result set limited to California. To further narrow down the search to cities beginning with the prefix “San “, we add the following for the third component:
start.addComponent(2,"San ",Composite.ComponentEquality.EQUAL);
end.addComponent(2, "San " + Character.MAX_VALUE, Composite.ComponentEquality.GREATER_THAN_EQUAL);
This gives us a list of all columns starting with “San “ as the city name. Note the use of appending Character.MAX_VALUE to take advantage for the comparator ordering.
A similar query making use of the equality operations, say to select all the cities for Wyoming and West Virginia (“WY” and “WV” respectively), could be constructed as follows:
start.addComponent(1,"WV",Composite.ComponentEquality.EQUAL);
end.addComponent(1,"WY",Composite.ComponentEquality.GREATER_THAN_EQUAL);
Null values are also allowed on insertion – for example if we wanted a “state level” column which had null for the city name, you can insert with only two components (or one!) of the composite populated. Obviously in doing so, you would want to check for null when retrieving the right-most components of the composite from a slice.
Hopefully that is enough of an overview to give you an idea of how powerful composite comparators for some use cases. The examples above are all MIT licensed, so make whatever use of them you can.
*Though it deals with a trivial amount of data in a simple format, CompositeDataLoader can be used as a model for application-level parallelized bulk loading with the Hector API. Feel free to experiment with this approach for you application bulk loading needs.
** The CompositeQuery class makes use of an auto-paging feature built into Hector via the ColumnSliceIterator class. CompositeQuery uses this class in conjunction with an inner java.lang.Iterable implementation to provide clean iteration semantics back up to the caller. Use this as an example of how to retrieve a moderate to large number of columns from a row.