Bulk Loading with Brian Hess
Brian Hess joins the show to explain why the bulk loader is a vital tool for a distributed database, the history of bulk loaders for Apache Cassandra, and the virtues of the new DSBulk.
Highlights!
0:15 - Jeff welcomes Brian Hess to the show and discusses the scalability of sweater vest clusters
1:09 - Why bulk loading is a capability that people just assume exists for all databases
2:29 - Existing tools for bulk loading for Cassandra / DSE include: 1) the cqlsh COPY TO / FROM command - which doesn’t scale or handle errors well
3:31 - 2) Cassandra’s sstableloader can be used to load data but isn’t really a bulk loader.
4:21 - 3) People have also used Spark and the DSE Spark Connector to load data
4:53 - 4) Brian wrote his own “Cassandra loader” open source project using CQL
6:32 - Introducing DS Bulk, a brand new bulk loader which builds on lessons learned from Cassandra loader
7:25 - Features include loading/unloading from JSON or CSV, number/data formats, security,
8:41 - Supported transformations include support for the now() function, not case management. The tool operates via std in/out so that you can chain results with tools like sed and awk
9:52 - Unloading features include column selection and filtering / limiting
11:24 - What makes DS Bulk a superior tool: 1) high performance (4x faster than cqlsh COPY)
13:12 - 2) Error handling including the ability to isolate errors and continue, and a dry run mode
16:08 - 3) Ease of use and configurability
16:42 - Challenging parts of building the driver were handling some offbeat use cases, getting the user experience right, and prioritizing features for this first release
20:02 - DS Bulk is a distinct tool from the DSE Graph Loader - at least for now
22:13 - Brian’s shout outs to the DS Bulk team