TechnologyJuly 8, 2019

DataStax Bulk Loader Pt. 5 — Counting

DataStax Bulk Loader Pt. 5 — Counting

The first three blog posts in this series dealt with data loading (here, here, and here), and the fourth blog post (here) dealt with data unloading.  This blog will deal with the count mode for dsbulk.


Something new in dsbulk version 1.1.0 is the ability to count the data in a table.  This is a common task that folks do once the load data to ensure that the data was loaded correctly.  It could be that a load failed midway and they would like to see how far it got. It could be that the primary key was not what was desired, and instead of unique inserts it became overwrites.

Example 23: Simple Counting

Let’s start with a simple count of the data:

$ dsbulk count -k dsbulkblog -t iris_with_id

Or

$ dsbulk count -k dsbulkblog -t iris_with_id --stats.modes global

Or

$ dsbulk count -k dsbulkblog -t iris_with_id -stats global

These all produce the same output:

Operation directory: /tmp/logs/COUNT_20190314-171517-238903.

total | failed | rows/s | mb/s | kb/row | p50 ms | p99ms | p999ms   

   150 |      0 | 400 | 0.00 |   0.00 | 18.68 | 18.74 |  18.74

Operation COUNT_20190314-171517-238903 completed successfully in 0 seconds.

150

Example 24: Counting without other information printed

We can remove the extraneous reporting information by reducing the verbosity via:

$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0

Which produces just:

150

Example 25: Counting by host

There are a few different ways to group up the counts. The first is to count the rows per host:

$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode hosts

I did this example on my local machine, so there is only one host. The output for me was:

/127.0.0.1:9042 150 100.00

The first column is the host, the second is the count, and the third is the percentage of the total.

Example 26: Counting by range

Sometimes it’s important to understand the size by the token ranges in the system. To do this, we use the ranges mode:

$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode ranges

Again, in a single node machine the output is a little simple:

-9223372036854775808 -9223372036854775808 150 100.00

Here we have the beginning token of the range, the ending token of the range (again, a little strange as we have only one range), the count, and the percentage.

Example 27: Counting the largest partitions

The partitions option will count the largest partitions for a table, in terms of number of rows per partition. To do this, we need to create a table with a clustering column. Let’s do that with the Iris data set using:

$ cqlsh -e "CREATE TABLE dsbulkblog.iris_clustered(id INT, petal_length DOUBLE, petal_width DOUBLE, sepal_length DOUBLE, sepal_width DOUBLE, species TEXT, PRIMARY KEY ((species), id))"

We can load the iris.csv data into it with:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_clustered

Now we can count the largest partitions with:

$ dsbulk count -k dsbulkblog -t iris_clustered --log.verbosity 0 --stats.mode partitions

Which produces this output:

'Iris-virginica' 50 33.33

'Iris-versicolor' 50 33.33

'Iris-setosa' 50 33.33

The first column is the primary key for the partition, the second is the count, and the third is the percentage of the total.

Example 28: Counting by ranges and hosts

The  --stats.mode parameter takes a list of modes, for example:

$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode ranges,hosts

Which produces this output:

Total rows per host:

/127.0.0.1:9042 150 100.00

Total rows per token range:

-9223372036854775808 -9223372036854775808 150 100.00

Example 29: Counting with a predicate

Sometimes you want to filter and only count some of the records. To do this, we can use the --schema.query (or -q) to specify the predicates. To do this, we will supply the full SELECT statement as if we were going to unload the data, but dsbulk will instead count the results.

For example, if we only wanted to count the rows where petal_width = 2, we could use:

$ dsbulk count -query "SELECT id FROM dsbulkblog.iris_with_id WHERE petal_width = 2 ALLOW FILTERING"

Which would produce the following output:

Operation directory: /tmp/logs/COUNT_20190314-171916-543786.

total | failed | rows/s | mb/s | kb/row | p50 ms |  p99ms | p999ms     

   6 |      0 | 18 | 0.00 |   0.00 | 130.81 | 131.07 | 131.07

Operation COUNT_20190314-171916-543786 completed successfully in 0 seconds.

6

We can add parallelism to this query by doing:

$ dsbulk count -query "SELECT id FROM dsbulkblog.iris_with_id WHERE Petal_width = 2 AND Token(id) > :start AND Token(id) <= :end ALLOW FILTERING"


To download the DataStax Bulk Loader, click here.

For an intro to unloading, read the previous Bulk Loader blog here.

For DataStax Bulk Loader Part 6 on Examples for Loading From Other Locations, go here.

dsbulk DataStax Enterprise

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.