TechnologyMarch 15, 2015

How to do Joins in Apache Cassandra® and DataStax Enterprise

How to do Joins in Apache Cassandra® and DataStax Enterprise

For years, a critique directed at NoSQL databases was that you couldn’t do join queries like those possible in an RDBMS. While this is true for some NoSQL databases, we thought it would be helpful to remind Apache Cassandra® users that join operations are indeed now possible with Cassandra.

There are a couple of ways that you can join tables together in Cassandra and query them:

  1. Use Apache Spark’s SparkSQL™ with Cassandra (either open source or in DataStax Enterprise - DSE).

  2. Use DataStax provided ODBC connectors with Cassandra and DSE.

In this post we’ll first illustrate how to perform SQL Joins [1] with Cassandra tables using SparkSQL and then look at how to use DataStax’s ODBC connector to easily create join queries[2] that can be used to create dashboards with BI software like Tableau [3].

Creating Join Queries Using Spark and Cassandra

While you can create your own Cassandra and Spark combination clusters using open source, its a lot easier to use DSE as it bundles and certifies Spark with Cassandra as part of its analytics package. To use Spark in DSE, you simply start one or more nodes in analytics mode, and you’re ready to roll.

DSE ships with a Weather Application Demo that shows how DSE Analytics works. We’ll use a couple of the objects in that demo to illustrate how to perform a simple join operation. For more details on how to setup the demo, and to view much more complicated join queries used in the application, please refer to our online documentation.

The tables used in this example have the following structures:

To create a join query, we first start one or more DSE nodes in analytic mode by executing:

We then indicate what keyspace to use, which in this case is the “weathercql” keyspace:

Creating a join operation with SparkSQL involves using the following syntax:

For this example, we’ll join data from the monthly and station table, store the results in a SparkSQL CassandraSQLContext - RDD (resilient distributed dataset) called “results”, iterate through and print the results:

Creating Join Queries with Cassandra and ODBC

You can create join queries on Cassandra data outside of Spark by using DataStax’s free ODBC driver (we also supply an ODBC driver for Spark). This means that any developer/DBA/BI/ETL tool that has ODBC connectivity can connect to and query data in Cassandra.

It is important to note that join operations done with the current ODBC driver should not involve large tables as the performance may not be acceptable for most queries that target big clusters.

Let’s take a look at how this works with one of the most popular BI tools in the market, which is Tableau. The steps below show a simple way to execute a freehand SQL join query using Tableau [3] and DataStax Enterprise 4.6.

1. First, we create an ODBC connection/datasource ( Fig 1) to DataStax Enterprise.

Fig 1: DataStax Cassandra ODBC Connector

2. Next, we open Tableau [3] and connect to Cassandra (Fig 2) using the ODBC connection created in the previous step.

Fig 2: Connecting to Cassandra using ODBC

3. Then, we code our join query using Tableau’s Custom SQL Query[4] editor (Fig 3) to create a dashboard that displays the join query’s results.(Fig 4).

Fig 3: Custom SQL against DataStax Enterprise

Fig 4: Tableau Dashboard

And Coming Soon...Joins on Steroids with Graph!

We announced our acquisition of Auerlius back in Feburary of this year, which is the company behind the Titan open source graph database. If you understand what a graph database can do, then you know that one thing it does very well is handle the traversal of multiple relationships between vertices (entities in an RDBMS world) without any need to create indexes or materialized views to overcome join performance inefficiencies in an RDBMS.

In short, a graph database represents the ultimate in joins where ease of use and performance are concerned. Coming soon in DSE will be DSE Graph, which will provide just this type of capability along with multi-model support in DSE.

As an example of how graph can dramatically reduce the complexity of join operations, the below comparison shows a sample, RDBMS join query on the left for a recommendation engine application vs. how the exact same query is handled in a graph database (Fig 5). Big difference, wouldn't you say?

Fig 5: RDBMS Join Query vs. Graph Database Join Query

Conclusion

Joining Cassandra tables together with SQL-styled queries can be carried out in multiple ways today, with each method being easy to use and code. For more information on creating joins on Cassandra data, please refer to the online documentation here and here. You can find downloads of DSE and our ODBC drivers on our downloads page.

[1] Customers need to consider to the costs of creating such ad-hoc queries against distributed databases.

[2] Customers need to run thorough query performance assessments when using this option

[3] Customers can use any Business Intelligence or ETL solution that supports standard JDBC/ODBC.

[4] Custom SQL is for illustrative purposes only. You can use any methodology supported by your BI vendor to create Reports or Dashboards. ETL jobs can be created in a similar method using your favorite ETL tool.

   

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.