TechnologyOctober 3, 2024

Apache Cassandra 5.0 and DataStax: The Benefits of Staying in Sync

Apache Cassandra 5.0 and DataStax: The Benefits of Staying in Sync

As an Apache Cassandra® committer and long-time advocate, I'd like to walk you through the relationship between the open-source Cassandra project and DataStax. With the recent release of Cassandra 5.0, it's the perfect time to explore how this collaboration drives innovation and benefits the entire community, while also examining the challenges faced by those who diverge from the main project.

There can be a lot of confusion about what implementation is compatible with what and how to navigate versions. The Cassandra project isn’t perfect, but I’m really proud of how we’ve managed the commercial/OSS problem in an age of open source projects switching licenses and leaving users feeling like the rug has been pulled out from under them. There are hundreds of engineers working on Cassandra and very few share the same employer. There are notable concentrations in places like DataStax, Apple, Netflix, and Netapp. We’ve built a way of sharing and interacting while moving forward for the benefit of everyone. 

The Cassandra enhancement proposal (CEP) process

One of the most important things that happened to the project was introducing the Cassandra Enhancement Proposal (CEP) process, which is at the heart of Cassandra's development. This structured approach ensures that large contributions are thoroughly vetted before inclusion in the project. Let's examine how this process works using a real-world example from Cassandra 5.0: Storage Attached Indexes (SAI).

SAI began as an internal project at DataStax, aimed at improving upon secondary indexes. After extensive testing and refinement in Astra DB, our Cassandra-as-a-service offering, we proposed SAI as a feature for Cassandra through the CEP process in August 2020. The journey from proposal to inclusion in Cassandra 5.0 took approximately four years, encompassing discussion, debate, merging, and rigorous testing.

This timeline might seem long, but it demonstrates the careful balance between rapid innovation and long-term stability that the Cassandra project maintains. Over the past five years, we've seen this pattern repeated with other major features, ensuring that Cassandra remains at the forefront of distributed database technology while maintaining its rock-solid stability.

DataStax's contributions to Cassandra 5.0

While SAI is a standout feature in Cassandra 5.0, it's not the only significant contribution from DataStax. We're proud to have also contributed trie indexes, unified compaction strategy, and vector search, three other notable big features that enhance Cassandra's capabilities and performance. Each went through the same process and vetting. 

These contributions underscore our commitment to pushing the boundaries of what's possible with distributed databases while ensuring that these innovations become part of the open-source ecosystem. Every major contributor of Cassandra follows this pattern. 

Another point worth mentioning is that we at DataStax share a commonality with all Cassandra users. We run a lot of Cassandra here—thousands of clusters with thousands of users. We are highly incentivized to find better ways to run Cassandra in a performant, efficient manner. Just like you, we like to sleep at night and hate creating root cause analysis docs. The majority of contributions you don’t see on the “Big Feature” list are directly related to keeping our operations team very bored and well-rested. 

The power of collaboration (and the pitfalls of divergence)

You might wonder why companies like DataStax, Netflix, Netapp, and Apple actively participate in this model and contribute their changes upstream. This isn’t normal is it? The answer lies in the collective benefits we all reap from each other's work. By staying closely aligned with the upstream version of Cassandra, we avoid the pitfalls of diverging codebases, which can lead to increased maintenance burdens and compatibility issues down the line.

However, not all Cassandra-compatible solutions follow this approach, and the consequences can be significant. Take Amazon Keyspaces and ScyllaDB, for example. These systems have maintained compatibility with Cassandra 3, but haven't kept pace with the changes introduced in versions 4 and 5. As a result, they now face a daunting task: catching up with years of development and innovation in the Cassandra ecosystem.

This divergence creates several challenges:

  1. Feature gap - Users of these systems miss out on the latest capabilities and performance improvements available in newer Cassandra versions.
  2. Compatibility issues - As applications evolve to leverage new Cassandra features, they become less portable to these divergent systems.
  3. Technical debt - The longer these systems remain out of sync with upstream Cassandra, the more difficult and resource-intensive it becomes to incorporate new features and improvements.
  4. Community isolation - Divergent systems might find themselves excluded from the broader Cassandra community's collective knowledge and problem-solving efforts.

The scale of the Cassandra project, with its numerous contributors and rapid pace of development, makes it incredibly challenging for a single company to keep up with all the changes independently. This reality underscores the value of staying aligned with the upstream project and actively contributing to its development.

Astra DB: A cutting-edge view of Cassandra

While the Cassandra project moves at a conservative, measured pace to ensure stability, Astra DB serves as a proving ground for new features and optimizations. Our cloud-native platform enables us to rapidly iterate on ideas and gather real-world feedback from thousands of databases and users.

The agility of Astra DB provides several key advantages:

  1. Rapid response to customer requirements - When new needs arise, such as the sudden demand for vector search capabilities to support generative AI applications, we can quickly develop and deploy solutions.
  2. Real-world testing - We can make new versions available to smaller subsets of users for testing, enabling us to gather valuable insights and refine features before wider release.
  3. Enhanced observability - Our extensive monitoring capabilities provide deep insights into how even small changes affect real-world usage patterns.
  4. User engagement in the CEP process - As users try out new features in Astra DB, they gain practical experience that allows them to provide more informed feedback and ideas during the Cassandra Enhancement Proposal process.

This approach contrasts with the Cassandra project, which relies more heavily on its comprehensive testing suite to harden releases for general use. Once the project releases code, it depends on end-users to produce bug reports, resulting in a longer feedback loop. This difference in processes incentivizes the Cassandra project to produce highly stable code, while Astra can push the boundaries of innovation.

Maintaining compatibility

You might wonder how DataStax ensures that Astra remains compatible with open-source Cassandra while trail blazing ahead with new features. The key lies in our centralized core database team, which is tasked with maintaining compatibility and merging downstream changes from other contributors. You’ll probably see these people active on the Cassandra mailing list and Slack. They have to have one foot inside DataStax and one in the Cassandra project. 

While it can be challenging when there's a major upstream contribution, our established processes and dedicated team make the downstream integration smoother. The guiding principle is to keep up with upstream changes and minimize divergence, ensuring that innovations developed in Astra can be efficiently contributed back to the main Cassandra project. 

The best of both worlds

For users, this symbiotic relationship between the Cassandra project and DataStax offers significant advantages. If you're looking to stay on the cutting edge of distributed database technology, Astra DB provides early access to new features and optimizations. At the same time, you can rest assured that you're not diverging from the open-source path, as these innovations make their way back into the main Cassandra project.

Moreover, Astra users benefit from continuous improvements without the need for manual upgrades. As new features are developed and refined, they're seamlessly integrated into the service, ensuring you're always running on the most advanced version of the technology. It’s that steady pace that gives you the assurance you are using the latest while avoiding the walled garden so common in other cloud databases. 

Home sweet home

The release of Cassandra 5.0 is a testament to the power of open-source collaboration and the importance of staying aligned with the upstream project. By fostering a close relationship between commercial entities like DataStax and the broader Cassandra community, we're able to drive innovation at a rapid clip while maintaining the stability and reliability that users expect from a mission-critical database. Astra DB is proof this is possible. 

The challenges faced by divergent systems like Amazon Keyspaces and ScyllaDB serve as a cautionary tale, highlighting the long-term benefits of remaining in sync with the main Cassandra project. Whether you choose to run Cassandra 5.0, modernize with cloud-native options such as Astra DB, or use a self-managed enterprise solution in Hyper-Converged Database (HCD), you're benefiting from this collaborative ecosystem and ensuring your database solution remains at the forefront of distributed database technology.

It's an exciting time for Cassandra users, and we at DataStax are proud to play our part in shaping the future of this incredible technology while demonstrating the value of staying closely aligned with the open-source community. If you have a few minutes and want to see what Cassandra 5 looks like, you can do it for free by creating an account and clicking the new database button.

As a Cassandra user, it might feel different at first, but I assure you, this is home sweet home. 

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.