TechnologySeptember 5, 2024

Apache Cassandra 5.0 Is Generally Available!

Apache Cassandra 5.0 Is Generally Available!

As an Apache Cassandra® committer and long-time advocate, I’m really happy to talk about the release of Cassandra 5.0. This milestone represents not just an upgrade to Cassandra but a big leap in usability and capabilities for the world's most powerful distributed database.

There’s something for everyone. Operators running large clusters and developers building applications will each find something exciting in this new release. It’s estimated that there are over 30,000 organizations using Cassandra, with over 100 million nodes deployed. A release like this has a big impact. 

Because the world relies so much on Cassandra, it’s important to remind everyone that Cassandra 5.0 marks the end-of-life for Cassandra 3.x. Now is the time for organizations to plan their upgrade strategy. Whether you're considering a move to Cassandra 5.0, exploring cloud-native options, or looking at enterprise solutions, the Cassandra ecosystem has a lot of options for you.

The journey to Cassandra 5.0

The road to this release has been both challenging and rewarding. As a community, we've pushed hard to maintain Cassandra's reputation for rock-solid stability while introducing envelope-pushing features. We entered  code freeze last November, starting the in-depth testing while in beta. Over 100 bugs were found and fixed in this period. Some of them were complicated regressions that took a lot of work to understand. The project ethos of "only ship when CI is green" approach meant a longer wait, but the result is a version that's truly ready for prime time on day one.

Let's examine the key features that make Cassandra 5.0 a transformative release. I’ll provide my personal take on each one, in addition to the features and benefits. 

1. Storage attached indexes (SAI)

SAI has been the most anticipated feature in this release, and for good reason. It revolutionizes query flexibility and performance, especially for large datasets. Cassandra users know that using columns outside the primary key can be restrictive in data modeling. SAI is a closer match to what you might create in a relational database—but at Cassandra scale. Your WHERE clause just got a lot more useful.

Patrick’s take: DataStax has been working on SAI for a number of years and has deployed it in our Cassandra-as-a-Service, Astra DB. The results have been nothing short of remarkable. End users love it, and we’ve been tuning and learning how it works at scale, making it very reliable. 

2. Trie memtables and trie SSTables

These may look like random words, but trust me, it represents an incredible advance for any database. These are low-level optimizations in storing and retrieving data, yielding impressive performance gains in memory usage and storage efficiency. The best part of the story is that it’s one of those “free” performance boosts. You don’t need to change a data model or use a new operator. Updating the server will get you the performance gains, making Java code more efficient at scale than any C++ code. It’s that good. 

Patrick’s take: My DataStax colleague Branimir Lambov has been leading the charge on Trie indexes for a number of years. He’s been publishing papers and giving talks that developers of other databases are paying attention to, and there isn’t any better form of flattery. 

3. Java development kit (JDK) 17 support

Painfully, Cassandra has been parked in an old version of Java support with JDK8. This change has been on operators' wishlists for a very long time. Today’s the day! Cassandra is making the jump up to JDK17, which brings an impressive set of performance gains—up to 20% in some cases. The underlying reason for these gains is in how JDK17 manages memory. Cassandra is a high-volume application with data going in and out at a furious pace. This can cause less powerful systems to back up and create the dreaded garbage collection event in Java. The Java project is on the path to eliminating this issue, and JDK 17 is a step in the right direction. 

Patrick’s take: Java is advancing faster than ever now. In our Astra service, we are already advancing the state of the art by migrating Cassandra workloads to JDK21 as we measure the effects on the thousands of clusters we manage. The result? Essential learnings that we’ll pass on to the Cassandra project as the pace of change in the JDK increases. If you’re in the camp of “C++ is better because it doesn’t do GC,” then it’s time to update your thinking. 

4. Unified compaction strategy (UCS)

UCS is a love letter to operators everywhere with the simple words “operational efficiency.” Let’s face it: nobody is storing less data. It’s common to start with a 10-node cluster and find yourself at 100 nodes before you know it. Compaction is a Cassandra process that constantly organizes stored data to make it as efficient as possible. There are many types, and each has a different requirement, which we call “strategies.” UCS is like an autopilot that just works; the outcome is a system that evolves and responds as requirements change. The best part? It just happens! So when you find yourself going from 10 to 100 to 1,000 nodes, it will adjust and give operators the best gift of all: time. 

Patrick’s take: UCS started as an effort at DataStax when working with some of our largest customers on increasing operational efficiency. There were several iterations, but this is the one that stuck. This is one of the things we ship in DataStax Enterprise (DSE) and Hyper-Converged Database (HCD) that, along with other tools, allows for 10T to even 20T per node. 

5. Vector search

AI is going through a big renaissance right now with a focus on generative AI. No surprise to anyone here, but it’s yet another data problem that requires the right tools. Vector search is a key part of GenAI techniques like retrieval-augmented generation, where instead of keywords, we use semantics to find data. Cassandra now supports a vector data type and indexing for an Approximate Nearest Neighbor search. While the vector search capabilities in Cassandra 5.0 are a great start, this is just the beginning.

Patrick’s take: At DataStax, we're already working on more advanced vector search capabilities (JVector v2 and v3) that go beyond what's in this release. The leading edge is moving fast in the JVector Github. We continue to push transaction speeds and volumes in Astra, which will keep the upstream contributions to Cassandra solid for future versions. 

What this means for the Cassandra ecosystem

Cassandra 5.0 is more than just a software update—it reflects open-source development's vibrant, collaborative nature. DataStax is a contributor to Cassandra, along with many other large operators, which speaks to the ground-up nature of this database. We are all solving problems at ridiculous scales, and sharing those with everyone else for use with Cassandra. DataStax runs tens of thousands of clusters. Apple (last reported) has hundreds of thousands of nodes. When we ship things in confidence, it’s not just because a feature works great on my laptop, we’re running the code we ship before it’s released. This diversity of contribution is what makes Cassandra truly special.

Get involved!

  1. Make some noise on social channels! This is a big deal, and everyone should know about it. Tell your network what you’re excited about and get the word out. Make sure and tag the Cassandra project on either X or LinkedIn. 
  2. Share your experience! Join the community discussions and share your insights or help someone out.
  3. Contribute! Whether it's code, documentation, or use cases, your contributions help shape the future of Cassandra.

DataStax is here to help on your Cassandra journey, especially with new GenAI applications. If you want to hand off the operations to somebody else, we have Astra DB for Cassandra workloads. And we’ve got some incredible migration services if you need them. For those of you wanting to stick with self-managed options but need more support and enterprise integrations, we have choices there too. HCD for Cassandra workloads running in Kubernetes. DataStax Enterprise for a more standard software deployment. However you want to run Cassandra, we’re here to help.

 

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.