TechnologyApril 30, 2019

DataStax Bulk Loader Pt. 3 — Common Settings

DataStax Bulk Loader Pt. 3 — Common Settings

This is the third blog post about dsbulk.  The first two blog posts (here and here) covered some basic loading examples.  This post will delve into some of the common options to load, unloading, and counting.


Example 14: Specifying the connection

The examples so far have loaded into DSE running on the local machine, which is the default host.  We can specify the host IP or name with the -h parameter:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -h 127.0.0.1

Example 14.1: Specifying the port

If you are in the unlikely scenario that DSE is listening on a different port than 9042, you can also specify the port using the -port parameter:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -h 127.0.0.1 -port 9042

Example 14.2: Specifying username/password

If your cluster is protected with username/password authentication, you can pass those in via the -u and -p parameters:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -u cassandra -p cassandra

NOTE: You should take care when submitting passwords on the command-line.  These can show up in system logs, environments, etc.

Example 14.3: Connecting to Kerberos enabled cluster

You can use dsbulk also with Kerberos-enabled cluster:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --driver.auth.provider DseGSSAPIAuthProvider --driver.auth.saslService --driver.auth.principal -h

Where:

  • service - service name, the same as service value in the ~/.cassandra/cqlshrc;
  • host - must be hostname as registered in Kerberos. The same hostname value in the ~/.cassandra/cqlshrc;
  • principal - Kerberos principal to use. Could be obtained via klist.

See here for more details.

Example 15: Specifying the Consistency Level

The default consistency level for dsbulk is LOCAL_ONE.  You can specify the desired consistency level via the -cl parameter, which is short for --driver.query.consistency:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -cl LOCAL_QUORUM

Example 16: Errors and Logging

Let’s face it, data is never as clean as you think it is.  Things happen, errors happen, it’s okay. As we’ve discussed, dsbulk can handle errors by logging them and copying the problematic records to a “bad file” to be looked at later.  But, dsbulk has other parameters to control the behavior of errors.

Example 16.1: Max Errors

Sometimes the data is so bad that you want to tell dsbulk to stop.  You can specify the maximum number of errors with the -maxErrors parameter, which is short for --log.maxErrors, and defaults to 10:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --log.maxErrors 3

Example 16.2: Setting the logging directory

dsbulk logs to a directory named logs in the current directory where dsbulk is invoked.  You can specify a different location with the -logDir parameter, which is short for --log.directory:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -logDir /tmp/logs

Example 16.3: Setting the logging verbosity

dsbulk has a way to dial up or down the verbosity of the logging, via --log.verbosity. Valid levels are 0, 1, and 2, from least to most verbose.  For example, to get the minimum output:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --log.verbosity 0

Example 17: Monitoring

dsbulk has a few ways to monitor the progress of dsbulk.  The two main ways are reporting to the screen and via JMX.

Example 17.1: Setting the report rate

The progress is reported to the screen by default every 5 seconds.  This rate can be set with the -reportRate parameter, which is a shortcut for --monitoring.reportRate:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -reportRate 2s

Or:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -reportRate "2 seconds"

Example 17.2: Enabling/Disabling JMX

By default, dsbulk will provide metrics via JMX.  You can disable this via the --monitoring.jmx parameter:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --monitoring.jmx false


To download the DataStax Bulk Loader click here.

For an intro to unloading, read the next Bulk Loader blog here.

For basic loading examples, read the previous Bulk Loader blog here.

dsbulk DataStax Enterprise

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.