CompanyNovember 9, 2021

Shatter Data Silos with DataStax Change Data Capture for Apache Cassandra

Swetha Polamreddy
Swetha Polamreddy
Shatter Data Silos with DataStax Change Data Capture for Apache Cassandra

Apache Cassandra®, the open-source, NoSQL distributed database trusted by thousands of enterprises for its unparalleled performance, just got better. Today we are excited to announce a new change data capture (CDC) solution to better connect the real-time data that powers businesses. Cassandra users, including those using DataStax Enterprise, can now capture database changes as a stream of events and expose them via Apache Pulsar™ to any downstream systems and clients.

DataStax CDC for Cassandra powered by Apache Pulsar enables you to derive substantially more value from your Cassandra data stores. Existing CDC for Cassandra tracks changes at the table level and since the table is usually replicated to multiple nodes in the cluster, it may generate duplicate change events for a single record insert/update/delete. This leads to problems downstream, such as data duplication, conflict resolution, and reconciliation. Whereas DataStax CDC for Cassandra deduplicates changes observed at the Cassandra replica level and exposes a “clean” set of changed rows (including the entire most recent data in the row) on a Pulsar topic. From there, Pulsar pipes those changes to wherever needed in the enterprise ecosystem. 

Pulsar provides a comprehensive platform for building real time data applications, as it can ingest millions of messages with very low latency and process and stream high-volume fast data flows in a distributed environment. Pulsar’s multi-layered architecture also enables decoupling of compute and storage, allowing for elastic scalability and support for unbounded message retention on low-cost storage. This feature gives CDC for Cassandra the ability to withstand localized network interruptions for an indeterminate amount of time and resume replication once connectivity is restored. From strengthening data recovery solutions to easily building real-time data pipelines, CDC for Cassandra can be of use in many ways.

Better real-time data pipelines with CDC for Cassandra

Enterprises need to connect data across their organizations to help them make data-driven decisions in real-time or near-real-time and provide the best service to customers and be one step ahead of the competition. Capturing data changes as they happen across different data sources, streaming, processing, and then publishing those changes to a target destination as part of the same system provides a single source of truth. This reduces the surface area for conflicts in security management, schema enforcement, and fault tolerance models. 

While Cassandra is a critical part of your data strategy, you also need to ensure that you can easily move data in and out with the ability to integrate with all the other systems and data stores in your organization. Adding CDC for Cassandra allows you to develop real-time data pipelines by leveraging the wide array of connectors and client libraries in Pulsar.

CDC for Cassandra Blog image

Change data Capture for Cassandra in action 

Below are some examples of how enterprises can leverage real-time data pipelines enabled by CDC for Cassandra 

Enable search integration

Cassandra is a highly scalable database known for its incredible read and write performance. However, sometimes your use case may require moving data from Cassandra into a more purpose-built search solution such as ElasticSearch. In these situations, CDC for Cassandra can simplify this process and automatically update your search indexes in real-time.

Analytics ready data 

Traditional business intelligence has long relied on batch processing to move data from operational data stores into data warehouses for reporting and analytical purposes. The lag caused by this batch processing is increasingly at odds with the need for real-time, up-to-the-moment information for business leaders who need to make snap decisions to remain competitive in today’s digital world. With CDC for Cassandra, delayed batch jobs can be replaced by immediate updates that automatically stream data changes from Cassandra into your data warehouse solution, providing an accurate picture that always reflects the current state of the data. Enterprises can now both simplify and modernize their data architecture leaving behind the inefficiencies of batch processing.

Operational ML 

Data science often involves the analysis of time series data which is not always easy to capture. With CDC for Cassandra, data scientists can more easily access an event stream of time series data that represents the changes that have happened on a table-by-table basis. These time series play a critical role in training ML models which can be used to extract greater insights and predictive capabilities. While these models are valuable on their own, operationalizing them as part of your data in motion strategy can further be achieved by using capabilities such as Pulsar functions to leverage these models to enrich data in real-time as part of your streaming data pipelines.

In today's digitized world, data is the lifeblood of any enterprise and the free flow of data is very critical. As enterprises grow, they tend to add new technologies or services that more often than not lead to multiple systems which are not tied together in a meaningful way. This fragmented architecture is expensive to maintain, difficult to improve, and almost always creates data silos. To resolve this, more and more enterprises are moving towards unified event-driven architectures. DataStax CDC for Cassandra powered by Pulsar is created specifically to help solve these issues. It will help you to simplify, improve and modernize data architectures while eliminating data silos.

Version Compatibility

CDC for Cassandra is compatible with DataStax Enterprise version 6.8.16 and open source Cassandra, versions 4.0.x and 3.11.x. 

Ready to build your own event-driven data architecture?

Datastax CDC for Apache Cassandra brings the expertise of DataStax engineers to Cassandra and DataStax Enterprise (DSE) users, in the form of enterprise assistance for enabling CDC use cases. It provides mission-critical support for the following software components:

  1. DataStax Cassandra Source Connector for Apache Pulsar, a Pulsar IO source connector
  2. DataStax Change Agent for Apache Cassandra works with Apache Pulsar, including DataStax Luna Streaming. Luna Streaming is professional support from highly skilled DataStax engineers who are experts at operating distributed Apache Pulsar clusters at scale.


You can also contact us today to schedule a demo and an architectural discussion.  

Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.