DataStax Bulk Loader Pt. 5 — Counting
The first three blog posts in this series dealt with data loading (here, here, and here), and the fourth blog post (here) dealt with data unloading. This blog will deal with the count mode for dsbulk.
Something new in dsbulk version 1.1.0 is the ability to count the data in a table. This is a common task that folks do once the load data to ensure that the data was loaded correctly. It could be that a load failed midway and they would like to see how far it got. It could be that the primary key was not what was desired, and instead of unique inserts it became overwrites.
Example 23: Simple Counting
Let’s start with a simple count of the data:
$ dsbulk count -k dsbulkblog -t iris_with_id
Or
$ dsbulk count -k dsbulkblog -t iris_with_id --stats.modes global
Or
$ dsbulk count -k dsbulkblog -t iris_with_id -stats global
These all produce the same output:
Operation directory: /tmp/logs/COUNT_20190314-171517-238903.
total | failed | rows/s | mb/s | kb/row | p50 ms | p99ms | p999ms
150 | 0 | 400 | 0.00 | 0.00 | 18.68 | 18.74 | 18.74
Operation COUNT_20190314-171517-238903 completed successfully in 0 seconds.
150
Example 24: Counting without other information printed
We can remove the extraneous reporting information by reducing the verbosity via:
$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0
Which produces just:
150
Example 25: Counting by host
There are a few different ways to group up the counts. The first is to count the rows per host:
$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode hosts
I did this example on my local machine, so there is only one host. The output for me was:
/127.0.0.1:9042 150 100.00
The first column is the host, the second is the count, and the third is the percentage of the total.
Example 26: Counting by range
Sometimes it’s important to understand the size by the token ranges
in the system. To do this, we use the ranges
mode:
$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode ranges
Again, in a single node machine the output is a little simple:
-9223372036854775808 -9223372036854775808 150 100.00
Here we have the beginning token of the range, the ending token of the range (again, a little strange as we have only one range), the count, and the percentage.
Example 27: Counting the largest partitions
The partitions
option will count the largest partitions for a table, in terms of number of rows per partition. To do this, we need to create a table with a clustering column. Let’s do that with the Iris data set using:
$ cqlsh -e "CREATE TABLE dsbulkblog.iris_clustered(id INT, petal_length DOUBLE, petal_width DOUBLE, sepal_length DOUBLE, sepal_width DOUBLE, species TEXT, PRIMARY KEY ((species), id))"
We can load the iris.csv
data into it with:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_clustered
Now we can count the largest partitions with:
$ dsbulk count -k dsbulkblog -t iris_clustered --log.verbosity 0 --stats.mode partitions
Which produces this output:
'Iris-virginica' 50 33.33
'Iris-versicolor' 50 33.33
'Iris-setosa' 50 33.33
The first column is the primary key for the partition, the second is the count, and the third is the percentage of the total.
Example 28: Counting by ranges and hosts
The --stats.mode
parameter takes a list of modes, for example:
$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode ranges,hosts
Which produces this output:
Total rows per host:
/127.0.0.1:9042 150 100.00
Total rows per token range:
-9223372036854775808 -9223372036854775808 150 100.00
Example 29: Counting with a predicate
Sometimes you want to filter and only count some of the records. To do this, we can use the --schema.query
(or -q) to specify the predicates. To do this, we will supply the full SELECT statement as if we were going to unload the data, but dsbulk will instead count the results.
For example, if we only wanted to count the rows where petal_width = 2
, we could use:
$ dsbulk count -query "SELECT id FROM dsbulkblog.iris_with_id WHERE petal_width = 2 ALLOW FILTERING"
Which would produce the following output:
Operation directory: /tmp/logs/COUNT_20190314-171916-543786.
total | failed | rows/s | mb/s | kb/row | p50 ms | p99ms | p999ms
6 | 0 | 18 | 0.00 | 0.00 | 130.81 | 131.07 | 131.07
Operation COUNT_20190314-171916-543786 completed successfully in 0 seconds.
6
We can add parallelism to this query by doing:
$ dsbulk count -query "SELECT id FROM dsbulkblog.iris_with_id WHERE Petal_width = 2 AND Token(id) > :start AND Token(id) <= :end ALLOW FILTERING"
To download the DataStax Bulk Loader, click here.
For an intro to unloading, read the previous Bulk Loader blog here.
For DataStax Bulk Loader Part 6 on Examples for Loading From Other Locations, go here.