Bulk Loading with Brian Hess

Brian Hess joins the show to explain why the bulk loader is a vital tool for a distributed database, the history of bulk loaders for Apache Cassandra, and the virtues of the new DSBulk.

Highlights!

0:15 - Jeff welcomes Brian Hess to the show and discusses the scalability of sweater vest clusters

1:09 - Why bulk loading is a capability that people just assume exists for all databases

2:29 - Existing tools for bulk loading for Cassandra / DSE include: 1) the cqlsh COPY TO / FROM command - which doesn’t scale or handle errors well

3:31 - 2) Cassandra’s sstableloader can be used to load data but isn’t really a bulk loader.

4:21 - 3) People have also used Spark and the DSE Spark Connector to load data

4:53 - 4) Brian wrote his own “Cassandra loader” open source project using CQL

6:32 - Introducing DS Bulk, a brand new bulk loader which builds on lessons learned from Cassandra loader

7:25 - Features include loading/unloading from JSON or CSV, number/data formats, security, 

8:41 - Supported transformations include support for the now() function, not case management. The tool operates via std in/out so that you can chain results with tools like sed and awk

9:52 - Unloading features include column selection and filtering / limiting

11:24 - What makes DS Bulk a superior tool: 1) high performance (4x faster than cqlsh COPY)

13:12 - 2) Error handling including the ability to isolate errors and continue, and a dry run mode

16:08 - 3) Ease of use and configurability

16:42 - Challenging parts of building the driver were handling some offbeat use cases, getting the user experience right, and prioritizing features for this first release 

20:02 - DS Bulk is a distinct tool from the DSE Graph Loader - at least for now

22:13 - Brian’s shout outs to the DS Bulk team