TechnologyAugust 12, 2014

User Defined Functions in Cassandra 3.0

User Defined Functions in Cassandra 3.0

User Defined Functions in Cassandra 3.0

Release 3.0 of Apache Cassandra will bring a new cool feature called User Defined Functions (UDF). Yes – users can write code that is executed inside Cassandra daemons.

UDFs are implemented by stateless code. By stateless I mean that a UDF implementation has just its input arguments to rely on. There is nothing like a shared state or execution context.

The programming language is your own choice. Java source support is built-in and even script languages that have JSR-223 support (javax.script) can be used.

So – the following statement creates a simple UDF which calculates some math:

1

2

3

4

5

6

7

CREATE FUNCTION my_sin ( input double )

    RETURNS double LANGUAGE java

    BODY

        return input == null

            ? null

            : Double.valueOf( Math.sin( input.doubleValue() ) );

    END BODY;

It can be used in a query like this:

1

2

SELECT key, my_sin(value) FROM my_table WHERE key IN (1, 2, 3);

UDFs are not restricted to a single argument. Pass in as many arguments you need.

1

2

3

4

5

6

7

8

9

CREATE FUNCTION my_adder ( val1 double, val2 double )

    RETURNS double LANGUAGE java

    BODY

        return (val1 == null || val2 == null)

            ? null

            : Double.valueOf( val1.doubleValue() + val2.doubleValue() );

    END BODY;

SELECT key, my_adder(net_price, taxes) AS price FROM my_table WHERE key = 42;

The final 3.0 release will support all all CQL types including user types, tuple types and collections as argument and return types. Currently, user+tuple types and collections are not supported.

What are UDFs good for?

Basically you can move some simple calculations to the Cassandra cluster. But that alone is not a huge win. UDFs are going to use used for functional indexes or aggregation to name a few. So stay tuned 

Namespaces

The function names used in the previous examples share the same context as built-in CQL functions like token() or now(). This may cause collisions in the future. For example, you created a UDF called foobar and we decide to add a CQL function which is also called foobar. In this case, the CQL function will be preferred. A warning is logged at Cassandra startup. Your only option is to drop that UDF – there is no way to execute it. Bad, of course.

To avoid such name collisions, use UDF namespaces. You can use your company’s name, your project name or whatever is unique for you. Referring to the example at the beginning, let’s assume the namespace super_math:

1

2

3

4

5

6

7

8

CREATE FUNCTION super_math::sin ( input double )


    RETURNS double LANGUAGE ...; -- continue with your implementation

SELECT key, super_math::sin(value) FROM my_table WHERE key IN (1, 2, 3);

  

CREATE FUNCTION super_math::adder ( val1 double, val2 double )


    RETURNS double LANGUAGE ...; -- continue with your implementation

SELECT key, super_math::adder(net_price, taxes) AS price FROM my_table WHERE key = 42;

Note that namespaces are just logical constructs to group UDFs.

UDF Overloading

UDFs allows to use the same function name with different signatures. If you want to code an adder UDF that takes 2, 3 and 4 arguments, create three UDFs:

1

2

3

4

5

6

7

CREATE FUNCTION super_math::adder ( val1 double, val2 double )


    LANGUAGE ...    -- continue with your implementation

CREATE FUNCTION super_math::adder ( val1 double, val2 double, val3 double )


    LANGUAGE ...    -- continue with your implementation

CREATE FUNCTION super_math::adder ( val1 double, val2 double, val3 double, val4 double  )

    LANGUAGE ...    -- continue with your implementation

So each of the following SELECTs works:

1

2

3

4

SELECT key, super_math::adder(net_price, taxes) AS fullprice FROM my_table WHERE key = 42;

SELECT key, super_math::adder(net_price, taxes, postage) AS fullprice FROM my_table WHERE key = 42;

SELECT key, super_math::adder(net_price, taxes, postage, extras) AS fullprice FROM my_table WHERE key = 42;

 Languages

All above examples assume, that the compiled byte code is in the class path of Cassandra since they directly refer to class and method names.

But you can also define UDFs using Java source and JSR-223 script languages.

Input and output

Input parameters and the return value always use the boxed Java types (Integer notintDouble not double, etc) to be able to indicate null value.

CQL and Java type comparison

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

CQL              Java

===              ====

boolean          java.lang.Boolean

int              java.lang.Integer

bigint           java.lang.Long

float            java.lang.Float

double           java.lang.Double

inet             java.net.InetAddress

text             java.lang.String

ascii            java.lang.String

timestamp        java.util.Date

uuid             java.util.UUID

timeuuid         java.util.UUID

varint           java.math.BigInteger

decimal          java.math.BigDecimal

blob             java.nio.ByteBuffer

list<E>          java.util.List<E>      where E is also a type from this list

set<E>           java.util.Set<E>       where E is also a type from this list

map<K,V>         java.util.Map<K,V>     where K and V is also a types from this list

(user type)      com.datastax.driver.core.UDTValue

(tuple type)     com.datastax.driver.core.TupleValue

The last five types will be implemented as part of CASSANDRA-7563.

Java source

Java source example:

1

2

3

4

5

6

7

8

9

CREATE OR REPLACE FUNCTION math::sin ( input double ) RETURNS double LANGUAGE java

BODY

 if (input == null) {

    return null;

  }

  double v = Math.sin( input.doubleValue() );

  return Double.valueOf(v);

END BODY;

But Java source has a restriction – you cannot use boxing and unboxing. We did not implement our own Java compiler – we just use javassist. And javassist does not support boxing and unboxing.

The code is distributed to all Cassandra nodes – you do not need to install Jar files manually.

JSR-223 Script languages

You want to implement your UDFs in Javascript or Groovy?

Javascript is supported out-of-the box. To use Groovy or any other JSR-223 language you have to install the language and it’s JSR-223 provider on each Cassandra node.

Example that implements the willingly used sin function in Javascript:

1

2

3

4

5

CREATE OR REPLACE FUNCTION jsmath::sin(val double)


    RETURNS double LANGUAGE javascript


    BODY


        Math.sin(val);


    END BODY;

Java bytecode (classes)

Distributing UDF code as classes (jar files) is possible but not recommended. You have to keep all Cassandra nodes in sync with the correct version of the UDF implementation. If only one node does not have the correct implementation, you’ll get into trouble (with your users).

Your UDF method implementations must be public static methods. By default the method is looked up in the specified class using the UDF name (without the namespace). To use another method, append a # and the name of the method to the UDF parameter. The following example looks up the method doubleAdderFor2 in classmy.package.MyAdder:

1

2

3

CREATE FUNCTION super_math::adder ( val1 double, val2 double )


    RETURNS double 'my.package.MyAdder#doubleAdderFor2';

 Schema Replication

UDFs are part of the system schema and transparently migrated to all other nodes in your cluster.

The new table in system keyspace is schema_functions.

CQL Syntax

1

2

3

4

5

6

7

8

9

10

11

12

CREATE [OR REPLACE]

    [[NON] DETERMINISTIC]

    FUNCTION [<namespace> :: ] <name>


    [IF NOT EXISTS]

    "("

        ( <argumentName> <argumentType> [, <argumentName> <argumentType>]* )?

    ")" RETURNS <returnType>

    (

       ‘<javaClassAndMethod>‘ |

        ( LANGUAGE <language> BODY <body> END BODY )


    )

namespace = UDF namespace
name = UDF name
OR REPLACE = replace the function if it already exists
[NON] DETERMINISTIC = mark a UDF as deterministic or non-deterministic – required for future use in functional indexes. Default is deterministic.
IF NOT EXISTS = only try to create function if it does not already exist
argumentName = name of the argument – used by Java and scripting languages
argumentType = argument type type, only CQL types are supported
returnType = return type type, only CQL types are supported
javaClassAndMethod = fully qualified classname – optionally postfixed with ‘#’ and method name
language = either java or any installed JSR-223 language name like javascript – the language name is not quoted
body = UDF implementation code

1

2

DROP FUNCTION [<namespace> :: ] <name>
    [IF EXISTS]

namespace = UDF namespace
name = UDF name
IF EXISTS = only try to drop function if it does not already exist

Note: DROP FUNCTION drops all overloads with the same name.

Test your UDF implementations!

Bad UDF implementations may be a great possibility to slow down your cluster. So please test your UDF implementations carefully using unit tests that test your UDFs using any possible input parameter combination. For example also try null, Double.NaN or Infinity as parameters to see what happens.

Security

Without authentication and authorization enabled, everyone can inject code into your Cassandra cluster. You can either trust all your users or enable authentication and authorization and grant permissions only to trusted users. Currently only global CREATEand DROP privileges apply for the CREATE FUNCTION and DROP FUNCTIONstatements. But this is subject to change and going to be improved (CASSANDRA-7557).

References

CASSANDRA-7395 Support for pure user-defined functions (UDF)
CASSANDRA-7562 Java source code for UDFs
CASSANDRA-7526 Defining UDFs using scripting language directly from CQL
CASSANDRA-4914 Custom aggregate and filtering functions in CQL
CASSANDRA-7458 functional indexes
CASSANDRA-7391 Partial indexes
CASSANDRA-7563 UserType, TupleType and collections in UDFs
CASSANDRA-7557 User permissions for UDFs

Author

User Defined Functions in Cassandra 3.0” was created by Robert Stupp, Cologne, Germany
snazy@snazy.de
@snazy

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.