August 7, 2020

Open Source FTW: New Tools For Apache Cassandra®

Last year, we made our bulk loading utility and Apache Kafka® connector available for free use with Cassandra and now these tools are open source and Apache-licensed. You can find and browse the source for these tools at the GitHub locations below.

Feel free to check out the code, fork the repositories or build the tools from the sources. You can also use GitHub issues to request new features or report any bugs; and of course, if you would like to contribute, pull-requests are most welcome. You can also find us at community.datastax.com for any general questions, see you there!

DSBulk at a glance

In the latest DSBulk version 1.6.0, you'll find a few nice improvements, bug fixes and sensibly better throughput for most unload operations. See the release notes and documentation for details.

The libraries that power this tool are also now available from Maven Central under the group ID com.datastax.oss. This means that it is now possible to invoke loading, unloading, and counting operations programmatically in your Java applications!

Let's see a quick example. Assume that we want to programmatically invoke a "load" operation to ingest many CSV files into Cassandra. First, we need to include a few dependencies:

Bash

<!- The Runner; should always be included -->
<dependency>
    <groupId>com.datastax.oss</groupId>
    <artifactId>dsbulk-runner</artifactId>
    <version>1.6.0</version>
</dependency>
<!- The Load Workflow: the operation we want to run -->
<dependency>
    <groupId>com.datastax.oss</groupId>
    <artifactId>dsbulk-workflow-load</artifactId>
    <version>1.6.0</version>
</dependency>
<!- The Connector: CSV in our case -->
<dependency>
    <groupId>com.datastax.oss</groupId>
    <artifactId>dsbulk-connectors-csv</artifactId>
    <version>1.6.0</version>
</dependency>
<!-- Two components required by the Load Workflow: an Executor and a Batcher -->
<dependency>
    <groupId>com.datastax.oss</groupId>
    <artifactId>dsbulk-executor-reactor</artifactId>
    <version>1.6.0</version>
</dependency>
<dependency>
    <groupId>com.datastax.oss</groupId>
    <artifactId>dsbulk-batcher-reactor</artifactId>
    <version>1.6.0</version>
</dependency>

With the above dependencies included in your application, invoking DSBulk becomes a piece of cake:

Bash

DataStaxBulkLoader dsbulk = new DataStaxBulkLoader(
  "load",
  "-h"  , "host1.com",         // contact point
  "-k"  , "my_keyspace",       // target keyspace
  "-t"  , "my_table",          // target table
  "-url", "/path/to/my/files"  // CSV sources
);
ExitStatus status = dsbulk.run();
System.out.println(status);

And voilà! The above triggers the load operation and prints the final exit status. All the benefits from the bulk loader's reactive, highly-efficient threading model are now available in your application with very little code. Log files and other artifacts generated by the tool during the operation would be located in the usual locations on your filesystem. (Please note though: the programmatic use of the bulk loader is new and considered in beta status for now)

However if you are planning to use the bulk loader as a standalone, ready-to-use tool, instead of downloading individual jars from Maven Central, we recommend using DSBulk's distribution bundle from our download server; the distribution bundle contains everything that is required for the tool to be executed from the command line.

Kafka Sink Connector at a glance

Like the bulk loader, the Kafka Sink Connector 1.4.0 ships with a few improvements, see the release notes for details or browse the documentation.

In the process of making our Kafka Sink Connector open-source, we had to change a few things. For end users, the most visible change is the connector class. If your configuration file had the below entry:

connector.class=com.datastax.kafkaconnector.DseSinkConnector

You should now change it to:

connector.class=com.datastax.oss.kafka.sink.CassandraSinkConnector

But don't worry! The old connector class still works, but the connector will print a deprecation warning to the logs. Apart from that, your existing configuration should just work the same way.

One noteworthy improvement is the ability to specify which errors you want to ignore when an error occurs during the processing of a Kafka record. You now have 3 possible values for the ignoreErrors option:

NONE: No errors are ignored, every error would make the connector stop; use this option if you are sure that your Kafka records are fully compliant with the database table schema, and that your database is always available.
DRIVER: Only driver errors are ignored and logged (that is, Kafka records that could not be written to the database); other errors (for example, mapping errors) would make the connector stop. Use this option if you know that your Kafka records should be compliant with the schema, but don't want the connector to stop if the database is unreachable or is experiencing trouble servicing requests.
ALL: All errors are ignored and logged; the connector would never stop, neither because of an invalid Kafka record, nor because the database is unreachable.

You can download the Kafka Sink Connector for Cassandra from our download server or from Confluent Hub.

Enjoy!

Discover more

Apache Cassandra®

JUMP TO SECTION

DSBulk at a glance

Kafka Sink Connector at a glance

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.

Learn More

Get Started for Free

Open Source FTW: New Tools For Apache Cassandra®

DSBulk at a glance

Kafka Sink Connector at a glance

Discover more

Share

DSBulk at a glance

Kafka Sink Connector at a glance

One-Stop Data API for Production GenAI

Subscribe to AI++