TechnologyJune 7, 2019

DataStax Bulk Loader Pt. 4 — Unloading

DataStax Bulk Loader Pt. 4 — Unloading

In the previous 3 blog posts (here, here, and here), we covered some loading examples, and covered some of the common options, such as logging and connection details.  In this blog post, we will turn our attention to unloading.


Example 18: Simple Unloading

Let’s start with a simple unloading example. We will unload the data into the /tmp/unload directory.

$ dsbulk unload -url /tmp/unload -k dsbulkblog -t iris_with_id

This will print a number of things to the screen as stderr, such as:

Operation directory: /tmp/logs/UNLOAD_20190314-170354-717718.

total | failed | rows/s | mb/s | kb/row | p50 ms |  p99ms | p999ms  

150 |      0 | 232 | 0.01 |   0.05 | 171.44 | 171.97 | 171.97

Operation UNLOAD_20190314-170354-717718 completed successfully in 0 seconds.

We could redirect stderr to /dev/null to avoid seeing this:

$ dsbulk unload -url /tmp/unload -k dsbulkblog -t iris_with_id 2> /dev/null

You’ll notice that the /tmp/unload directory has two files in it named output-000001.csv and output-000001.csv. The reason why this has the 000001 suffix is that dsbulk will unload in parallel, using multiple threads, if the data size is large enough. Since our data is small it will use just two threads. The supplied -url is a directory and the files are placed inside that directory.

Note that if you specify a directory with data in it already, you will get an error like this:

Operation UNLOAD_20190314-170542-812259 failed: connector.csv.url target directory: /tmp/unload must be empty.

Example 18.1: Specifying the root of the output files

We can actually specify the output root to be other than output. We do this with the --connector.csv.fileNameFormat, whose default is output-%0,6d.csv:

$ dsbulk unload -url /tmp/unload -k dsbulkblog -t iris_with_id --connector.csv.fileNameFormat "iris-%0,6d.csv"

Now we see that the /tmp/unload directory contains two files named iris-000001.csv and iris-000002.csv.

Example 18.2: Unloading to stdout

The default location for unloading is actually stdout:

$ dsbulk unload -k dsbulkblog -t iris_with_id

Which will have output such as:

id,petal_length,petal_width,sepal_length,sepal_width,species
5,1.7,0.4,5.4,3.9,Iris-setosa

1,1.4,0.2,4.9,3.0,Iris-setosa
...

Example 18.3: Piping the output to other programs

We talked about using command-line tools to modify the input and then pipe into dsbulk. The same can be done with the output. For example, removing the Iris- prefix from the species name (we still know that these are all irises):

$ dsbulk unload -k dsbulkblog -t iris_with_id 2> /dev/null | sed 's/Iris-//g' > /tmp/unload/iris_shortname.csv

The first few lines of that file are:

$ $ head -5 /tmp/unload/iris_shortname.csv
id,petal_length,petal_width,sepal_length,sepal_width,species

144,5.7,2.5,6.7,3.3,virginica
23,1.7,0.5,5.1,3.3,setosa
114,5.1,2.4,5.8,2.8,virginica
53,4.0,1.3,5.5,2.3,versicolor

Example 19: Unloading some columns

It is certainly the case where we do not wish to output all of the columns from the table. There are multiple ways to do this.

Example 19.1: Specifying via mapping

We can specify only a subset of the columns via a simple mapping, via the -m option:

$ dsbulk unload -k dsbulkblog -t iris_with_id -m "id,species"

Which results in the following:

id,species
23,Iris-setosa

114,Iris-virginica
53,Iris-versicolor
110,Iris-virginica
...

Example 19.2: Specifying via a custom query

We can specify this using a custom query. For example, to only

$ dsbulk unload -query "SELECT id, species FROM dsbulkblog.iris_with_id"

Which will result in output like:

id,species
23,Iris-setosa
114,Iris-virginica

53,Iris-versicolor
110,Iris-virginica
...

Example 19.3: Specifying via a custom query with multiple threads

The previous example is a little less efficient since it needs to be run in a single thread. We can get a multi-threaded version using the special :start and :end fields:

$ dsbulk unload -query "SELECT id, species FROM dsbulkblog.iris_with_id WHERE Token(id) > :start AND Token(id) <= :end"

This will do one query per token range, substituting in the starting and ending tokens for each range. Note that to use the Token() function you must supply all partition key columns in the order they were defined.

Example 19.4: Specifying extra columns in a custom query

It is possible to return additional columns, such as the writetime for a column, using the custom query option:

$ dsbulk unload -query "SELECT id, species, writetime(species) AS writetime FROM dsbulkblog.iris_with_id"

Which will result in data like:

id,species,writetime
128,Iris-virginica,1552583016883989

23,Iris-setosa,1552583016820016
114,Iris-virginica,1552583016866729
53,Iris-versicolor,1552583016828860
110,Iris-virginica,1552583016867375
...

Example 19.5: Specifying a predicate in a custom query

If a predicate is valid CQL, then we could provide a predicate in a custom query, as well:

$ dsbulk unload -query "SELECT id, species FROM dsbulkblog.iris_with_id WHERE id IN (101,102,103,104,105)"

Which produces the following output:

id,species
101,Iris-virginica

102,Iris-virginica
103,Iris-virginica
104,Iris-virginica
105,Iris-virginica

If you wanted to inspect a particular token range, you could specify that, as well:

$ dsbulk unload -query "SELECT id, species FROM dsbulkblog.iris_with_id WHERE Token(id) > 0 AND Token(id) < 100000000000000000"

Which would provide the following output:

id,species
115,Iris-virginica

Example 19.6: Specifying a DSE Search predicate

We can supply a DSE Search predicate via the solr_query mechanism (note: this requires enabling DSE Search). To do this, let’s create a new table and enable a search index on it:

$ cqlsh -e "CREATE TABLE dsbulkblog.iris_with_search (id int PRIMARY KEY, petal_length double, petal_width double, sepal_length double, sepal_width double, species text);"

$ cqlsh -e "CREATE SEARCH INDEX IF NOT EXISTS ON dsbulkblog.iris_with_search"

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_search

Now, we can issue a query for just the Iris-setosa rows with:

$ dsbulk unload -query "SELECT id, petal_length, petal_width, sepal_length, sepal_width, species FROM dsbulkblog.iris_with_search WHERE solr_query = '{\\\"q\\\": \\\"species:Iris-setosa\\\"}'" --executor.continuousPaging.enabled false

Notice a few things here. First, we are disabling Continuous Paging, because DSE Search does not work with Continuous Paging. Second, we are not just escaping the double-quotes in the solr_query string, but we are “double-escaping” them, because there is an extra interpretation of the string and we need to do that. Lastly, notice that we are projecting out a specific set of columns. If we did

$ dsbulk unload -query "SELECT * FROM dsbulkblog.iris_with_search WHERE solr_query = '{\\\"q\\\": \\\"species:Iris-setosa\\\"}'" --executor.continuousPaging.enabled false

We would actually get a column named solr_query in the result:

Operation directory: /tmp/logs/UNLOAD_20190320-180708-312514

total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms   

50 |      0 | 162 | 0.01 |   0.05 | 26.80 | 26.87 |  26.87

Operation UNLOAD_20190320-180708-312514 completed successfully in 0 seconds.
id,petal_length,petal_width,sepal_length,sepal_width,solr_query,species
0,1.4,0.2,5.1,3.5,,Iris-setosa

1,1.4,0.2,4.9,3.0,,Iris-setosa
2,1.3,0.2,4.7,3.2,,Iris-setosa
3,1.5,0.2,4.6,3.1,,Iris-setosa
4,1.4,0.2,5.0,3.6,,Iris-setosa
...

This is a little annoying, but easy enough to accommodate.

Example 20: Delimiter

As with loading, sometimes we want a different delimiter than the default, which is a comma. For example:

$ dsbulk unload -k dsbulkblog -t iris_with_id -delim "\t"

Which produces this tab-delimited output:

id petal_length petal_width sepal_length sepal_width species
23 1.7 0.5 5.1 3.3 Iris-setosa

114 5.1 2.4 5.8 2.8 Iris-virginica
53 4.0 1.3 5.5 2.3 Iris-versicolor
110 5.1 2.0 6.5 3.2 Iris-virginica
91 4.6 1.4 6.1 3.0 Iris-versicolor
...

Example 21: Nulls

Similar to loading, we may want to replace the database NULL values with a null string. For example, let’s say that the dsbulkblog.iris_with_id table was loaded with the iris_with_null_strings.csv data, as in Example 5.1, and now we want to unload that data but replace the NULL value with the string N/A:

$ dsbulk unload -k dsbulkblog -t iris_with_id -nullStrings "N/A"

Results in output like:

id,petal_length,petal_width,sepal_length,sepal_width,species
23,1.7,0.5,5.1,3.3,N/A

114,5.1,2.4,5.8,2.8,N/A
53,4.0,1.3,5.5,2.3,N/A
110,5.1,2.0,6.5,3.2,N/A
91,4.6,1.4,6.1,3.0,N/A
...

Example 22: Date Format

Like in loading, we can specify the format for the dates using the -codec.date parameter:

$ dsbulk unload -k dsbulkblog -t president_birthdates -delim "\t" --codec.date "EEEE MMMM d, y GGGG"

Which produces the following output:

president birthdate birthdate_string
Thomas Jefferson Saturday April 13, 1743 Anno Domini April 13, 1743

James Monroe Friday April 28, 1758 Anno Domini April 28th, 1758
George Washington Friday February 22, 1732 Anno Domini February 22, 1732
John Adams Sunday October 30, 1735 Anno Domini October 30 1735
John Quincy Adams Saturday July 11, 1767 Anno Domini July 11, 1767
James Madison Tuesday March 16, 1751 Anno Domini March 16, 1751


To download the DataStax Bulk Loader click here.

Read the next Bulk Loader blog about the count mode to dsbulk here.

To delve into some of the common options to load, unloading, and counting, read the previous Bulk Loader blog here.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

dsbulk DataStax Enterprise

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.