CompanyMarch 24, 2022

Change Data Capture (CDC) with Astra DB Now Makes Building Your Own Real-Time Data Pipeline Easy

Change Data Capture (CDC) with Astra DB Now Makes Building Your Own Real-Time Data Pipeline Easy

The only constant is change. Nowhere does this axiom hold more true than when it comes to your data. At every moment of every day, thousands upon thousands of changes are being made throughout your organization in response to customer interactions, supplier and partner inputs, employee actions and so forth. Organizations that want to become truly data-driven are facing increasing demands to not only have access to all of the pieces of data that represent the various aspects of their business, but also to have visibility into the changes those pieces of data undergo. And, they need the ability to take action in real time in response to those changes.

Traditionally, getting access to data changes has been the responsibility of event-driven architectures that relied heavily on individual applications publishing a subset of events onto a message bus. 

As the demand for real-time applications has grown, these traditional approaches have proven incapable of scaling to meet the needs of real-time, data-driven enterprises – both in terms of their ability to scale the middleware and infrastructure and the ability to scale their real-time app development. These approaches are commonly responsible for introducing bottlenecks and adding development overhead as organizations seek to get immediate visibility into a comprehensive set of change events across their business domains.

One solution that has emerged in response to these challenges is change data capture (CDC). With CDC, the responsibility of publishing change events is moved from applications to the single source of truth – the database that serves the application. 

In recent years, CDC has become a popular approach for eliminating many of the bottlenecks and overhead that was associated with constantly updating applications to generate a complete set of change events. For relational database platforms that operate at a relatively small scale compared to NoSQL platforms like Cassandra, these solutions were a big improvement. 

Operating CDC with an infinitely scalable, leaderless, distributed architecture like Apache CassandraⓇ, however, has proven to be a much more difficult challenge to solve. Changes could originate from one of any number of nodes in a database cluster and be propagated throughout the cluster members. This made features that were trivial in relational databases, like sequencing and deduplication orders of magnitude more complex to solve. Organizations had to decide whether to tackle this complexity head-on or find workarounds, such as batch ETL processes, to detect data changes in their Cassandra database. Either case meant more overhead. 

Now, that difficult decision is a thing of the past. Astra DB has already become the go-to database for organizations that need the scale and pure unbridled performance of Cassandra without any of the operational burden of managing their own NoSQL platform. With the launch of CDC for Astra DB, these same organizations have an equally capable CDC solution that was built to satisfy the performance and throughput requirements of even the most demanding internet-scale use cases.

CDC for Astra DB is powered by Astra Streaming, a multi-cloud streaming as a service built on Apache Pulsar. Using a simple configuration based approach, you can enable CDC on one or more of your Astra DB tables and publish the changes to an event topic in Astra Streaming. From there, your real-time applications can subscribe to change events using client libraries in Java, Golang, Python, or Node.js. Additional endpoints support direct subscription via websocket interface or using a standard JMS client.

If the destination of your CDC data is another platform such as Snowflake, ElasticSearch, Kafka or Redis (to name just a few), Astra Streaming also allows you to create real-time data pipelines through a simple configuration-driven interface using the built-in connector library.

Using Astra CDC, you can accelerate the development of a wide range of use cases important to data-driven organizations including:

  • Data integration: Immediately send updated data throughout your data ecosystem when a piece of data changes in Astra DB.
  • Machine learning: Leverage Astra Streaming’s event persistence capabilities to replay a sequence of changes as inputs into ML models for training and scoring purposes.
  • Real-time applications: Build applications that respond to CDC change events to drive business logic in response to specific changes being detected in your Astra database.
  • Advanced search: Push data from your Astra DB instance into a full text search engine such as Elastic.
  • Notifications: Detect when changes on your Astra database occur and integrate with platforms such as Firebase to send SMS or push notifications.
  • Reporting and analytics: Ensure that business stakeholders are using up to date data to make critical decisions that can impact your business.
  • Security monitoring: Gain visibility into anomalous behavior that may indicate a security breach with CDC’s consumable stream of event data.

CDC for Astra DB opens up a new set of possibilities by leveraging the powerful capabilities already in Astra to provide limitless scale for your data at rest, in motion, and everything in between.

You can refer to our product documentation to learn more about CDC for Astra DB,  including pricing, currently supported cloud regions. We also offer a quickstart guide to help you get started. The first step, though, is to register for your free Astra DB. With 80 GB free monthly and Astra Streaming already built in, you can be on your way to building your own streaming data pipeline in minutes!    

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.