User Defined Functions in Cassandra 3.0

Release 3.0 of Apache Cassandra will bring a new cool feature called User Defined Functions (UDF). Yes – users can write code that is executed inside Cassandra daemons.

UDFs are implemented by stateless code. By stateless I mean that a UDF implementation has just its input arguments to rely on. There is nothing like a shared state or execution context.

The programming language is your own choice. Java source support is built-in and even script languages that have JSR-223 support (javax.script) can be used.

So – the following statement creates a simple UDF which calculates some math:

CREATE FUNCTION my_sin ( input double )

RETURNS double LANGUAGE java

BODY

return input == null

? null

: Double.valueOf( Math.sin( input.doubleValue() ) );

END BODY;

It can be used in a query like this:

1 2	`SELECT key, my_sin(value) FROM my_table WHERE key IN (1,` `2,` `3);`

UDFs are not restricted to a single argument. Pass in as many arguments you need.

CREATE FUNCTION my_adder ( val1 double, val2 double )

RETURNS double LANGUAGE java

BODY

return (val1 == null || val2 == null)

? null

: Double.valueOf( val1.doubleValue() + val2.doubleValue() );

END BODY;

SELECT key, my_adder(net_price, taxes) AS price FROM my_table WHERE key = 42;

The final 3.0 release will support all all CQL types including user types, tuple types and collections as argument and return types. Currently, user+tuple types and collections are not supported.

What are UDFs good for?

Basically you can move some simple calculations to the Cassandra cluster. But that alone is not a huge win. UDFs are going to use used for functional indexes or aggregation to name a few. So stay tuned

Namespaces

The function names used in the previous examples share the same context as built-in CQL functions like token() or now(). This may cause collisions in the future. For example, you created a UDF called foobar and we decide to add a CQL function which is also called foobar. In this case, the CQL function will be preferred. A warning is logged at Cassandra startup. Your only option is to drop that UDF – there is no way to execute it. Bad, of course.

To avoid such name collisions, use UDF namespaces. You can use your company’s name, your project name or whatever is unique for you. Referring to the example at the beginning, let’s assume the namespace super_math:

CREATE FUNCTION super_math::sin ( input double ) 

RETURNS double LANGUAGE ...; -- continue with your implementation

SELECT key, super_math::sin(value) FROM my_table WHERE key IN (1, 2, 3);

CREATE FUNCTION super_math::adder ( val1 double, val2 double ) 

RETURNS double LANGUAGE ...; -- continue with your implementation

SELECT key, super_math::adder(net_price, taxes) AS price FROM my_table WHERE key = 42;

Note that namespaces are just logical constructs to group UDFs.

UDF Overloading

UDFs allows to use the same function name with different signatures. If you want to code an adder UDF that takes 2, 3 and 4 arguments, create three UDFs:

CREATE FUNCTION super_math::adder ( val1 double, val2 double ) 

LANGUAGE ... -- continue with your implementation

CREATE FUNCTION super_math::adder ( val1 double, val2 double, val3 double ) 

LANGUAGE ... -- continue with your implementation

CREATE FUNCTION super_math::adder ( val1 double, val2 double, val3 double, val4 double )

LANGUAGE ... -- continue with your implementation

So each of the following SELECTs works:

SELECT key, super_math::adder(net_price, taxes) AS fullprice FROM my_table WHERE key = 42;

SELECT key, super_math::adder(net_price, taxes, postage) AS fullprice FROM my_table WHERE key = 42;

SELECT key, super_math::adder(net_price, taxes, postage, extras) AS fullprice FROM my_table WHERE key = 42;

Languages

All above examples assume, that the compiled byte code is in the class path of Cassandra since they directly refer to class and method names.

But you can also define UDFs using Java source and JSR-223 script languages.

Input and output

Input parameters and the return value always use the boxed Java types (Integer notint, Double not double, etc) to be able to indicate null value.

CQL and Java type comparison

CQL Java

=== ====

boolean java.lang.Boolean

int java.lang.Integer

bigint java.lang.Long

float java.lang.Float

double java.lang.Double

inet java.net.InetAddress

text java.lang.String

ascii java.lang.String

timestamp java.util.Date

uuid java.util.UUID

timeuuid java.util.UUID

varint java.math.BigInteger

decimal java.math.BigDecimal

blob java.nio.ByteBuffer

list<E> java.util.List<E> where E is also a type from this list

set<E> java.util.Set<E> where E is also a type from this list

map<K,V> java.util.Map<K,V> where K and V is also a types from this list

(user type) com.datastax.driver.core.UDTValue

(tuple type) com.datastax.driver.core.TupleValue

The last five types will be implemented as part of CASSANDRA-7563.

Java source

Java source example:

CREATE OR REPLACE FUNCTION math::sin ( input double ) RETURNS double LANGUAGE java

BODY

if (input == null) {

return null;

}

double v = Math.sin( input.doubleValue() );

return Double.valueOf(v);

END BODY;

But Java source has a restriction – you cannot use boxing and unboxing. We did not implement our own Java compiler – we just use javassist. And javassist does not support boxing and unboxing.

The code is distributed to all Cassandra nodes – you do not need to install Jar files manually.

JSR-223 Script languages

You want to implement your UDFs in Javascript or Groovy?

Javascript is supported out-of-the box. To use Groovy or any other JSR-223 language you have to install the language and it’s JSR-223 provider on each Cassandra node.

Example that implements the willingly used sin function in Javascript:

CREATE OR REPLACE FUNCTION jsmath::sin(val double) 

RETURNS double LANGUAGE javascript 

BODY

  Math.sin(val);

  END BODY;

Java bytecode (classes)

Distributing UDF code as classes (jar files) is possible but not recommended. You have to keep all Cassandra nodes in sync with the correct version of the UDF implementation. If only one node does not have the correct implementation, you’ll get into trouble (with your users).

Your UDF method implementations must be public static methods. By default the method is looked up in the specified class using the UDF name (without the namespace). To use another method, append a # and the name of the method to the UDF parameter. The following example looks up the method doubleAdderFor2 in classmy.package.MyAdder:

CREATE FUNCTION super_math::adder ( val1 double, val2 double )

  RETURNS double 'my.package.MyAdder#doubleAdderFor2';

Schema Replication

UDFs are part of the system schema and transparently migrated to all other nodes in your cluster.

The new table in system keyspace is schema_functions.

CQL Syntax

CREATE [OR REPLACE]

[[NON] DETERMINISTIC]

FUNCTION [<namespace> :: ] <name> 

[IF NOT EXISTS]

"("

( <argumentName> <argumentType> [, <argumentName> <argumentType>]* )?

")" RETURNS <returnType>

(

‘<javaClassAndMethod>‘ |

( LANGUAGE <language> BODY <body> END BODY ) 

)

namespace = UDF namespace
name = UDF name
OR REPLACE = replace the function if it already exists
[NON] DETERMINISTIC = mark a UDF as deterministic or non-deterministic – required for future use in functional indexes. Default is deterministic.
IF NOT EXISTS = only try to create function if it does not already exist
argumentName = name of the argument – used by Java and scripting languages
argumentType = argument type type, only CQL types are supported
returnType = return type type, only CQL types are supported
javaClassAndMethod = fully qualified classname – optionally postfixed with ‘#’ and method name
language = either java or any installed JSR-223 language name like javascript – the language name is not quoted
body = UDF implementation code

1 2	`DROP FUNCTION [<namespace> :: ] <name>  [IF EXISTS]`

namespace = UDF namespace
name = UDF name
IF EXISTS = only try to drop function if it does not already exist

Note: DROP FUNCTION drops all overloads with the same name.

Test your UDF implementations!

Bad UDF implementations may be a great possibility to slow down your cluster. So please test your UDF implementations carefully using unit tests that test your UDFs using any possible input parameter combination. For example also try null, Double.NaN or Infinity as parameters to see what happens.

Security

Without authentication and authorization enabled, everyone can inject code into your Cassandra cluster. You can either trust all your users or enable authentication and authorization and grant permissions only to trusted users. Currently only global CREATEand DROP privileges apply for the CREATE FUNCTION and DROP FUNCTIONstatements. But this is subject to change and going to be improved (CASSANDRA-7557).