TechnologyMay 24, 2022

Cassandra Data Loading: 8 Tips for Loading Data into Astra DB

Cassandra Data Loading: 8 Tips for Loading Data into Astra DB

The most commonly asked Apache CassandraⓇ and Astra DB question is: What is the easiest way to load large amounts of data into Astra DB quickly? The answer is the DataStax Bulk Loader

The DataStax Bulk Loader tool (dsbulk) is a command line tool for loading and unloading data from Cassandra and Astra DB. dsbulk helps to load, unload, count data from DataStax Astra cloud databases, DataStax Enterprise (DSE) 4.7 and later databases, and open source Apache Cassandra® 2.1 and later databases.There is a wide variety of options available to you to help tailor the DataStax Bulk Loader tool to whatever you’re using it for. 

In this blog, we'll expand on the documentation we provide for the dsbulk command with eight tips from the DataStax engineering team to help you optimize the bulk data loading process. 

If you haven't installed dsbulk yet, you can set up the tool using the following commands:

curl -OL https://downloads.datastax.com/dsbulk/dsbulk-1.8.0.tar.gz

Then, unpack the downloaded distribution:

tar -xzvf dsbulk-1.8.0.tar.gz

To learn more about dsbulk setup, take a look at our documentation.

Tip #1: Run the DSBulk Loader on a virtual machine

While running your migration, we recommend using a virtual machine (VM) in the same region as your database to decrease latency and increase throughput (number of rows you can load per second).

DSBulk can be easily installed on a VM using the installation commands above. We strongly recommend using a virtual machine instead of running DSBulk directly on your laptop.

Tip #2: Load data directly from AWS S3 or Google Cloud Storage

For data that doesn't fit on a single machine's hard drive, or even just to leverage the convenience of cloud object storage, dsbulk can load large amounts of data directly from AWS S3 or Cloud Storage on Google Cloud Platform (GCP). 

Load a single CSV file hosted on GCP by passing dsbulk a file url:

dsbulk load -url https://storage.googleapis.com/bucket/filename.csv -k ks -t table -b ~/scb.zip -u client_id -p client_secret

Load multiple CSVs hosted on GCP by passing dsbulk a list of file names:

dsbulk load --connector.csv.urlfile https://storage.googleapis.com/bucket/files.csv -k ks -t table -b ~/scb.zip -u client_id -p client_secret

Tip #3: The DSBulk Loader works well with Astra DB

To connect to Astra DB you need a Secure Connect Bundle (SCB), and application token. You can download the secure database bundle and obtain your application token from the DataStax Astra DB web console

dsbulk is compatible with Astra DB by passing your SCB to the -b flag, client id to the -u flag and client secret to the -p flag.

Tip #4: Dealing with rate limits

Astra DB's default rate limit is 4,098 ops/second. Once you've hit the limit, you'll get the following message from the server “rate limit reached”.

The message appears because Astra DB caps the throughput for free databases. If you want more throughput, upgrade to a pay-as-you-go Astra DB plan

Tip #5: DSBulk tool pooling options

Astra DB works better with more client connections. You want to set the number of connections to 16 in the Java driver when you run dsbulk. To do so, add the following flag to your DSBulk command:

--driver.advanced.connection.pool.local.size 16

Tip #6: Tuning DSBulk

Performance tuning is about understanding the bottlenecks in a system and removing them to improve performance. What is performance? In the case of bulk loading we optimize for throughput (as opposed to latency) because the goal is to get as much data into the system as fast as possible. This is different from a traditional Cassandra operational environment where we might optimize for query latencies.

For a deeper dive into the relationship between latency and throughput (under concurrency) take a moment to review Little's Law.

In practice, as we try to push data faster with DSBulk (the client), we may see latencies increase on AstraDB (the server). If we don't, that's a sign that we still have plenty of database capacity and that we can continue to increase the rate in DSBulk. If on the other hand, your latencies are increasing without an increase in throughput, you may have to wait for your database to autoscale or open a support request to get better performance.

DSBulk throughput can be controlled with a few different flags:

  1. --maxConcurrentQueries

  2. --dsbulk.executor.maxPerSecond

  3. --dsbulk.executor.maxInFlight

All three of these flags control the same thing (target client throughput). They just do so by three different means. So remember to pick only ONE. The documentation recommends tuning maxConcurrentQueries because it is technically the most efficient. However, we find that maxPerSecondis easier for users to understand, so we recommend it for almost all scenarios.

To keep a closer eye on the client-side latencies, use the -report-rate flag. You can also watch the database side latencies in your AstraDB Health Tab.

Tip #7: Handling Errors

If your bulk load is pushing the system to its limits you may want to configure errors and retries so that your job doesn't just stop when it hits too many errors. Note DSBulk logs any failed inserts in the logs directory, and you can re-process any missed queries in a subsequent run:

Before calling a row an error, set the maximum number of errors before stopping the process with --dsbulk.log.maxErrors and the maximum number of retries with --driver.advanced.retry-policy.max-retries.

Tip #8: Onboarding engineers 

Need additional help with your data load? No problem. We've got a team of engineers working round the clock, five days a week. Click the chat icon on the bottom right corner of the Astra portal to start a chat and get immediate help from an engineer. All you've got to do is let them know the amount of data and the deadline to upload it.

The Final Command

Here's what your command might look like with all the options set:

dsbulk load -url https://storage.googleapis.com/bucket/filename.csv -k ks -t table -b ~/scb.zip -u client_id -p client_secret --driver.advanced.connection.pool.local.size 16 --dsbulk.executor.maxPerSecond 10000 --dsbulk.log.maxErrors 100 --driver.advanced.retry-policy.max-retries=3 --report-rate 10

Conclusion

Loading very large datasets onto Astra DB can be a breeze if you follow the best practices in this article. Datastax Astra DB is a serverless, multi-cloud native DBaaS built on Apache Cassandra®. Astra DB database guardrails and limits help to enforce best practices: setting parameters for operations and ensuring the database operates with consistent performance.

To learn even more about Astra DB, we recommend watching the Astra DB videos on the DataStax Developers YouTube channel. Our developers introduce you to the most frequently used Astra DB features.

We hope you find these tips helpful and hope for your experience using Astra DB to be  fruitful and rewarding.

If you prefer to learn about DSBulk via video, check out this quick overview from Steven Smith.

Need additional help loading your data into Cassandra or Astra? Reach out to us at hello@datastax.com.

Resources 

  1. Astra DB

  2. DataStax Bulk Loader

  3. Apache Cassandra®

  4. AWS S3

  5. Google Cloud Storage 

  6. DataStax Community Platform

  7. DataStax Academy

  8. DataStax Certifications

  9. DataStax Workshops

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.