Cassandra User Defined Functions using the Python Driver
Introduction
Apache Cassandra 2.2 introduced two new useful features: User Defined Functions (UDFs) and User Defined Aggregates (UDAs). These new features allow certain types of computation to occur server-side, directly on the Cassandra cluster. This can reduce network traffic to the client and reduce client-side resource utilization. The combination of these features can improve performance in certain scenarios. As a part of a series of two blog posts, this first post will briefly demonstrate some of these scenarios where UDFs can be used, how these would have been implemented pre-Cassandra 2.2, and finally how these can be simplified using UDFs via the DataStax Python Driver. My second post will go over using UDAs with the Python Driver.
All examples in this post will be using the following simple keyspace, shown here created via the Python driver:
1 2 3 4 5 |
|
User Defined Functions (UDFs)
UDFs are functions that are run directly on Cassandra as part of query execution. The scripting portion of the UDF can be performed by any language that supports the Java Scripting API, such as Java, Javascript, Python, Ruby, and many other languages (JARs need to be dropped into the classpath to support Python/Ruby). Similar to User Defined Types (UDTs), UDFs are associated with a specific keyspace, and if no keyspace is explicitly specified, the current keyspace is used. As UDFs are a part of the Cassandra schema, they are automatically propagated to all nodes in the cluster. To use UDFs, they must be explicitly enabled in cassandra.yaml:
1 |
|
Here's the syntax for creating a UDF:
1 2 3 4 5 6 |
|
UDFs are identified by their signature: the combination of its keyspace, function name and arguments. As such, it's possible to overload UDFs by having the same function name as long as the argument size or argument types are different. The arguments can be literals or terms, including collection types such as lists, maps and UDTs. Cassandra uses a bundled version of the DataStax Java Driver to perform collection and UDT conversions internally. As UDFs are simple scripts, it is possible for them to fail during runtime due to conditions such as illegal arguments or null pointer exceptions, thus we must take care to handle these cases. If a UDF does fail this way, the entire query is aborted.
All examples in this section will be using the following schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Inserted via the Python Driver:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
And the resulting data in the inventory table:
1 2 3 4 5 6 7 |
|
Example #1: Finding the length of a column
Let's say we're interested in all items that can be sold in at least 10 states. In pre-Cassandra 2.2, we would have to retrieve all the rows from the database, and the entire available_states
column to do some computation on the client side and find the number of states per item:
1 2 3 4 5 6 7 8 9 |
|
This is costly not only because we have to do computations on the client-side, but we're also reading in the entire available_states
column when we simply wanted its size. This can be simplified using a UDF:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Here, we're creating a UDF called len_states
, that takes in a list of values and returns its length. This greatly reduces the amount of network traffic on the wire, as now we're simply returning the size (which is what we care about), rather than the entire list of states. On the client-side, now we simply iterate through the list of items to check its size. In my next post, we will see how we can use UDAs to simplify even this computation and directly retrieve the list of items that match our conditions. Notice here that the resulting column name of the SELECT
query after calling the UDF is simplex_len_states_available_states
, which is in the form <keyspace>_<udf_name>_<column_name_queried_by_udf>
.
Example #2: Retrieving an element from inside a UDT
Similarly, let's say we're only interested in the zipcode
of the item_location
(for calculating postal charges, for example) from inside of the Address
UDT. In pre-Cassandra 2.2, we would so something like:
1 2 3 4 5 6 7 8 9 |
|
Here once again we're retrieving the entire UDT, where we only need one piece of it-- namely the zipcode
. In Cassandra 2.2, we can use a UDF to simplify and speed up the process:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Notice that we can directly use the UDT by name as an argument to our UDF, and all of the Java driver's UDT manipulation mechanisms are available to us.
Example #3: Performing a server-side calculation of a column
As a final example, consider perhaps a use case where we need to calculate the volume of an item (perhaps once again for postal charges). We're interested in both the volume, as well as the proper units for the calculated volume. In pre-Cassandra 2.2, we would retrieve all the dimensions
and perform the volume calculation on the client-side:
1 2 3 4 5 6 7 8 9 |
|
In Cassandra 2.2, we can move the volume computation to the server via a UDF. Since UDFs work horizontally across a single column in a single row, this is readily implemented:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Another benefit here is that with the volume computation stored on the server, any client reading data from this Cassandra cluster can retrieve the same calculated volume information, cutting down on redunant code across multiple applications.
Conclusion
By supporting JSR 223 scripting languages, UDFs allows you to write server-side executable scripts in your favorite language among many supported languages. While UDFs are useful by itself, UDFs in conjunction with UDAs bring out the full functionality of Cassandra 2.2. In my next post, I will discuss taking advantage of User Defined Aggregates using the Python driver.