DataStax Bulk Loader Pt. 3 — Common Settings
This is the third blog post about dsbulk. The first two blog posts (here and here) covered some basic loading examples. This post will delve into some of the common options to load, unloading, and counting.
Example 14: Specifying the connection
The examples so far have loaded into DSE running on the local machine, which is the default host. We can specify the host IP or name with the -h parameter:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -h 127.0.0.1
Example 14.1: Specifying the port
If you are in the unlikely scenario that DSE is listening on a different port than 9042, you can also specify the port using the -port parameter:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -h 127.0.0.1 -port 9042
Example 14.2: Specifying username/password
If your cluster is protected with username/password authentication, you can pass those in via the -u and -p parameters:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -u cassandra -p cassandra
NOTE: You should take care when submitting passwords on the command-line. These can show up in system logs, environments, etc.
Example 14.3: Connecting to Kerberos enabled cluster
You can use dsbulk also with Kerberos-enabled cluster:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --driver.auth.provider DseGSSAPIAuthProvider --driver.auth.saslService --driver.auth.principal -h
Where:
- service - service name, the same as service value in the ~/.cassandra/cqlshrc;
- host - must be hostname as registered in Kerberos. The same hostname value in the ~/.cassandra/cqlshrc;
- principal - Kerberos principal to use. Could be obtained via klist.
See here for more details.
Example 15: Specifying the Consistency Level
The default consistency level for dsbulk is LOCAL_ONE. You can specify the desired consistency level via the -cl parameter, which is short for --driver.query.consistency:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -cl LOCAL_QUORUM
Example 16: Errors and Logging
Let’s face it, data is never as clean as you think it is. Things happen, errors happen, it’s okay. As we’ve discussed, dsbulk can handle errors by logging them and copying the problematic records to a “bad file” to be looked at later. But, dsbulk has other parameters to control the behavior of errors.
Example 16.1: Max Errors
Sometimes the data is so bad that you want to tell dsbulk to stop. You can specify the maximum number of errors with the -maxErrors parameter, which is short for --log.maxErrors, and defaults to 10:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --log.maxErrors 3
Example 16.2: Setting the logging directory
dsbulk logs to a directory named logs in the current directory where dsbulk is invoked. You can specify a different location with the -logDir parameter, which is short for --log.directory:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -logDir /tmp/logs
Example 16.3: Setting the logging verbosity
dsbulk has a way to dial up or down the verbosity of the logging, via --log.verbosity. Valid levels are 0, 1, and 2, from least to most verbose. For example, to get the minimum output:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --log.verbosity 0
Example 17: Monitoring
dsbulk has a few ways to monitor the progress of dsbulk. The two main ways are reporting to the screen and via JMX.
Example 17.1: Setting the report rate
The progress is reported to the screen by default every 5 seconds. This rate can be set with the -reportRate parameter, which is a shortcut for --monitoring.reportRate:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -reportRate 2s
Or:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -reportRate "2 seconds"
Example 17.2: Enabling/Disabling JMX
By default, dsbulk will provide metrics via JMX. You can disable this via the --monitoring.jmx parameter:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --monitoring.jmx false
To download the DataStax Bulk Loader click here.
For an intro to unloading, read the next Bulk Loader blog here.
For basic loading examples, read the previous Bulk Loader blog here.