CompanyJanuary 27, 2021

Data, Data Everywhere: Bringing Together the High Performance Stack for Distributed Data

Data, Data Everywhere: Bringing Together the High Performance Stack for Distributed Data

At DataStax, we spend our time helping developers and enterprises everywhere use the powerful, open-source Apache Cassandra™ database to build modern data apps with global scale, zero downtime, and zero lock-in.  

In 2020, we focused on making Cassandra easier to use so that millions of developers and enterprises could benefit from its scale and performance and derive more value from data. 

Now, we’re excited to be expanding our offerings to include another part of the stack that is becoming increasingly critical for data at scale: streaming.

Streaming, scale and distributed computing

Organizations everywhere are increasingly adopting streaming platforms to collect and correlate massively distributed data at high speeds> across expanding, distributed ecosystems. The total market size for Cloud Event Stream Processing Software in 2024 is forecast to be $8.5 billion, according to IDC1. These platforms drive real-time analytics and many data science and machine learning initiatives across the enterprise. They can enable differentiated user experiences like real-time order tracking, user notifications and recommendations. Because streaming features often serve users and systems that are geographically dispersed, it’s critical that streaming capabilities provide performance, replication, and resiliency across disparate geographies.

If you’re dealing with Apache Cassandra, these requirements should look familiar, because they’re the same types of things that you need Casssandra for as well. And as developers look to do more around event-based and streaming architectures, we’re seeing strong interest in bringing together the power and scale of Cassandra with distributed, high scale streaming technology.

To address this, starting early in 2020, we started to look for the right streaming technology for the data stack that was both open and - more importantly - able to support multi-cloud. It would have to support a distributed model in the same way as Apache Cassandra, decoupling message broker nodes from specific partitions and making it easier to scale by adding nodes rather than with time-consuming and expensive topic repartitioning.

Apache Pulsar: cloud-native streaming at scale

While a lot of the hype in streaming has been around Apache Kafka, we've seen that top architects and developers, especially those that are dealing with cloud-native production use cases on Kubernetes, have been talking about Apache Pulsar. Pulsar was built at Yahoo! as a streaming platform that would run fully distributed, so it met our criteria. We rolled up our sleeves and started working with it ourselves. We built key parts of our DataStax Astra cloud platform on top of Pulsar, as well as getting involved in the growing Apache Pulsar community. 

Along the way, we became very impressed with what Chris Bartholemew and his team at Kesque were doing to make Pulsar easy to use, and we are excited to have them join forces with us at Datastax. Our CTO Jonathan Ellis has written a great blog post on the benefits of Pulsar, and Chris has written a post on his team's journey that I hope you read.

We know from experience that effectively managing data at global scale requires a robust platform. Top video streaming platforms deal with over a trillion events each day as users search, watch, pause and play videos. Retailers must manage changes to inventories from thousands of stores across the globe in real time to adapt to the realities of today’s hyper-efficient modern supply chain. These are simply not possible without a platform like Pulsar that has been purpose-built to address these distributed computing challenges at scale.

All this is to say that we are extremely proud to now offer DataStax Luna Streaming, a production-ready, open-source distribution and support subscription for Apache Pulsar. Together with our Apache Cassandra, developers can build a powerful unified data platform with seamless access to both data at rest and data in motion.   

Last year, we brought Cassandra to Kubernetes and to the cloud, making it dead simple to operate for any developer, and then we introduced Stargate to make building apps on top of Cassandra from any language trivial. Cassandra is now the world’s most scalable database, and it’s also the easiest to use.  But we've always felt that DataStax needed to live up to its name, to truly be able to support the open multi-cloud data stack. 

Streaming is the next major piece of that stack, and I'm very excited about this new chapter in the Datastax journey. 

Learn more about DataStax Luna Streaming:



[1] IDC Semiannual Software Tracker, November 12, 2020

 

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.