TechnologyJanuary 19, 2022

How DataStax Enterprise Analytics simplifies Migrating to DataStax Astra DB

How DataStax Enterprise Analytics simplifies Migrating to DataStax Astra DB

DataStax Enterprise (DSE), the hybrid cloud NoSQL database built on Apache Cassandra®, integrates real-time and batch operational analytics capabilities with an enhanced version of Apache Spark™. DSE Analytics enables you to easily generate ad-hoc reports, target customers with personalization, and process real-time streams of data (you can learn about the integrated benefits here).

By shifting to a true cloud-native open data stack built on DataStax Astra DB and Astra Streaming, enterprises can quickly unlock even more capabilities and harness the power of data to transform their businesses.

Enterprises that are already running DSE can easily achieve this migration with the help of DSE Analytics. This Spark tool, which we’ll introduce in this post, enables you to make a direct connection to your Astra DB from DSE Analytics without having to provision a separate Spark cluster or worrying about other setups. This application was created to move data from DSE to Astra DB using DSE Analytics, but with simple changes can be altered to reverse the flow or compare data between the two data stores.

Prerequisites

Before you get started, there are a few small but important steps you need to take care of to prepare for your lift-off to Astra DB!

Dual writes

Setting up dual writes prior to your migration will enable you to perform a zero-downtime migration from DSE to Astra DB. This is not a necessity, though, and beyond the scope of what I want to show you today. If you’re interested in exploring how to set up dual writes, you can learn more in this blog post.

Download example Spark code

These instructions reference sample code that can be found here. The sample code will be set to use DSE 6.8.18. Using a different version of DSE will require you to update the DSE version in the build.sbt file located in the root directory of the project.

Validate your DSE version

To connect to Astra DB with Spark via DSE Analytics, you will need to upgrade to or install/run a minimum version of DSE listed below. These versions contain DSP-21510 which enables the connection with Astra DB:

Create your Astra DB database

Ensure that your Astra DB Database has been created and is ready to accept data. Instructions for how to create an Astra DB database can be found here. Once this is confirmed, you will need to create the appropriate table definitions in your Astra DB instances for your migration. This can be accomplished via the CQL console in the Astra DB UI or by using the REST or GraphQL APIs. For the purposes of these instructions, we will use the following example schema, but you could leverage any schema for the migration using this procedure:

Keyspace: test_spark_migration

Table: data_table

CREATE TABLE test_spark_migration.data_table (
  id uuid, 
  scan_date timestamp, 
  payload text, 
  version_number int, 
  PRIMARY KEY (id, scan_date)
);

Download the Secure Connect bundle

You will need to download the Secure Connect Bundle from Astra DB to connect to your Astra DB instance via the Spark-Cassandra-Connector, which is included with your distribution of DSE. By following the red numbers in the diagram below, you will generate the Secure Connect bundle for download. Upload this to your DSE Analytics cluster and note the absolute path for later.

Figure 1. Procedure to download the Secure Connect Bundle (SCB) from Astra DB GUI.

Generate your application token

To generate the application token, follow the diagram below. We will leave the credentials and tokens unobscured for the purposes of this guide to reduce any possible confusion, but the database and tokens will be deleted when this guide is complete. Once you have downloaded your Application Token CSV, proceed to the next step.

Figure 2. Navigate to the Application Token creation page on the AstraDB GUI.

Figure 3. Procedure to create and download the Application Token from Astra DB GUI.

Configuring the Spark code

Once you have the necessary information, you can begin to alter the Scala code at the Migration.Scala. The entirety of our changes will happen in the following location:

sparkConfAstra.set("spark.cassandra.connection.config.cloud.path", [SCB zip])
sparkConfAstra.set("spark.cassandra.auth.username", "[username]")
sparkConfAstra.set("spark.cassandra.auth.password", "[password]")
sparkConfAstra.set("spark.dse.continuousPagingEnabled", "false")
sparkConfAstra.set("spark.cassandra.connection.localDC", "[localdc]")

// set value for keyspace and table
val keyspace = "test_spark_migration"
val table = "data_table"

Each of the sparkConfAstra options will be replaced with information from our Application Token. Using the information in the previous examples, my completed configuration would look like this:

sparkConfAstra.set("spark.cassandra.connection.config.cloud.path", "secure-connect-spark-migration.zip")
sparkConfAstra.set("spark.cassandra.auth.username", "NSTdMgHRhzjrZSOObphkmAcv")
sparkConfAstra.set("spark.cassandra.auth.password", "K.ynJ6YNk3Z6TKaKQPDhv4Q2j_1iOC8pSDv7q-bJLCspeOJ8nEzdnAhCyTUiQeZ28pr97sP8vM66qhii,aacg0GnTXfI2.KPAMfYPOyOl5mEulv-p.Evbqvf3rsnDPut")
sparkConfAstra.set("spark.dse.continuousPagingEnabled", "false")
sparkConfAstra.set("spark.cassandra.connection.localDC", "westus2")

// set value for keyspace and table
val keyspace = "test_spark_migration"
val table = "data_table"

The keyspace and table will not be changed in this guide, but make sure you change them if you plan to use a different keyspace/table combination.

Compiling the Spark jar

Once the changes to the Migration.Scala and build.sbt files have been made, you are ready to compile your Spark jar.

To do so, run the following command from the root of the project: 

sbt clean assembly

The Migration.Jar will reside in the root of the project at:

dse-astra-spark-migration/target/scala-2.11/Migration.jar

Running the migration

Now that the jar is compiled, we can carry out the migration. To do so, execute the following command from your DSE cluster:

dse -u [username] -p [password] spark-submit --class com.astra.spark.Migration --executor-cores
[num cores] --num-executors [num executors] --executor-memory [GB of memory]G --files [SCB Path] Migration.jar
  • [username] = Source DSE username
  • [password] = Source DSE password

  • [num cores] = Int value for number of cores
  • [num executors] = Int value for number of executors
  • [GB of memory] = Int value for GB of memory. Note the “G” needs to be supplied.
  • [SCB Path] = Absolute path to the secure connect bundle provided by Astra DB.

Note that the [num cores], [num executors], and the [GB of memory] values will be determined based on the resources available for your DSE Analytics enabled cluster.

What’s next?

Now that you have learned how to migrate your application data from DSE to Astra DB using the powerful, built-in DSE Analytics, check out the references below to build your open data stack and transform your customer experience and business!

Other resources

Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.