A Decade of Apache Cassandra® Data Modeling

Data modeling has been a challenge with Apache Cassandra® for as long as the project has been around. In the earlier days of the project, many such questions were asked in the Cassandra tag on Stack Overflow. As someone who spent a lot of time answering questions in that tag (1,175, to be exact), I’ve seen my share of questions about data modeling. In fact, I answered so many questions about Cassandra data modeling that I was inspired to write about one particular aspect of it: result set ordering.

This culminated in a March 2015 blog post titled, “We Shall Have Order!,” that DataStax was kind enough to publish several years before I actually worked here.

The problem posed by the original “We Shall Have Order!”

The subject of the original post was created from an amalgamation of questions that I had answered about querying and sorting. The Cassandra cluster in question was serving a message board application, and they had to be able to see posts by user and by time. As was the conventional Cassandra wisdom at the time, the idea was to show how to solve this problem using “query tables,” or rather, tables designed to serve a specific query.

Assuming that we have an original Cassandra table named “posts,” that looks like this:

CREATE TABLE posts (
  posttime timestamp,
  postcontent text,
  postid uuid,
  userid bigint,
  PRIMARY KEY (posttime));

To solve for querying posts for a specific user sorted by time, we built a query table:

CREATE TABLE posts_by_user_by_year (
  userid bigint,
  posttime timestamp,
  postid uuid,
  postcontent text,
  year bigint,
  PRIMARY KEY ((userid,year),posttime)
) WITH CLUSTERING ORDER BY (posttime DESC);

This solution worked for several reasons:

Posts could be queried by userid (and year).
The result sets would leverage the on-disk ordering and be sorted by posttime, descending.
The partition was “bucketed” by year, to keep the partition from growing too big.

But if a developer wanted to query that data in a different way, they’d have to create a new query table. An inconvenience, yes. But for the time, this worked really well. In fact, there are many high-performing (and high-profile) applications today that are built on these ideas, which evolved out of similar distributed data modeling techniques.

Modern solutions with modern indexing

So how would we solve this problem today? Certainly, after a decade’s worth of time, we have tools and functions at our disposal that can help us to better solve this problem from a developer’s perspective.

The way that this would be solved in the relational world: we would have one, single table. Any column that we needed to query by, would have an index added to it. Can we do that in Cassandra?

While index efficiency has been an issue with Cassandra for years, we now have storage attached indexes (SAI) in Cassandra 5.0 (CEP-7). SAI indexes are engineered to prevent many of the scaling issues that hindered the previous indexing strategies.

So how could SAI help us here?

Let’s assume we have the original table, but with postid as the PRIMARY KEY:

CREATE TABLE stackoverflow.posts_orig (
    postid uuid PRIMARY KEY,
    postcontent text,
    posttime timestamp,
    userid bigint);

Now, let’s create two indexes:

CREATE CUSTOM INDEX ON stackoverflow.posts_orig(userid)   USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX ON stackoverflow.posts_orig(posttime) USING 'StorageAttachedIndex';

The first index allows us to query by userid:

SELECT postcontent, posttime FROM posts_orig WHERE userid=2;

 postcontent                                            | posttime
--------------------------------------------------------+---------------------------------
                         Is your license in order, pal? | 2015-01-25 19:04:00.000000+0000
                           B26354, filed and monitored. | 2015-01-25 19:22:00.000000+0000
 I don't get it. Why do they risk coming back to Earth? | 2015-01-25 19:03:00.000000+0000

(3 rows)

Can the second index enable us to enforce a sort order? While Cassandra can’t just yet, the latest version of SAI in Astra DB will let us do that. Here, we’ll order our result set by posttime:

SELECT postcontent, posttime FROM posts_orig ORDER BY posttime DESC LIMIT 5;

 postcontent                                            | posttime
--------------------------------------------------------+---------------------------------
                             Game over, man! Game over! | 2015-01-25 19:25:00.000000+0000
                           B26354, filed and monitored. | 2015-01-25 19:22:00.000000+0000
                                    I aim to misbehave. | 2015-01-25 19:21:00.000000+0000
                         Is your license in order, pal? | 2015-01-25 19:04:00.000000+0000
 I don't get it. Why do they risk coming back to Earth? | 2015-01-25 19:03:00.000000+0000

(5 rows)

As we can see, SAI in Astra DB not only enables rows to be queried by non-key columns, it also allows us to enforce a sort order by indexed columns. This is a vast improvement over the previously available indexing paradigms.

NOTE: Currently in Astra DB, enforcing a sort order with an SAI-indexed column requires a LIMIT clause with a value of less than 1000.

Solving with the new Data API

And what about DataStax’s new Data API? Could we accomplish this in Astra DB, while working under a document paradigm? First of all, we’d have to change our toolset a little, from cqlsh to something that can work with restful JSON like Postman or even cURL. With that, we could send a POST request (while passing along our Astra API token) to create a new collection in our cluster:

https://cab00884-ea42-4e4e-a426-e4199fb25536-us-east1.apps.astra.datastax.com/api/json/v1/stackoverflow
{ "createCollection": {"name": "posts_collection" }}

With that collection, we can then send another POST request to insert our data:

https://cab00884-ea42-4e4e-a426-e4199fb25536-us-east1.apps.astra.datastax.com/api/json/v1/stackoverflow/posts_collection

{ "insertMany": {
    "documents": [
        {"1", "2015-01-25 13:25:00-0600", "Game over, man! Game over!"},
        {"1", "2015-01-25 13:02:00-0600", "Why don't you put her in-charge?"},
        {"2", "2015-01-25 13:22:00-0600", "B26354, filed & monitored."},
        {"2", "2015-01-25 13:03:00-0600", "I don't get it. Why do they risk coming back to Earth?"},
        {"2", "2015-01-25 13:04:00-0600", "Is your license in-order, pal?"},
        {"3", "2015-01-25 13:21:00-0600", "I aim to misbehave."}
    ]
    }
}

With some documents present in the collection, we can run a find operation (via a POST request), like this:

https://cab00884-ea42-4e4e-a426-e4199fb25536-us-east1.apps.astra.datastax.com/api/json/v1/stackoverflow/posts_collection
{"find": 
    {
        "filter": {"userid":"2"},
        "sort" : { "posttime":-1 }
    }
}

Which returns:

 {
   "data": {
       "documents": [
           {
               "_id": "a80cf2b2-8572-4d16-8cf2-b285724d16e3",
               "userid": "2",
               "posttime": " '2015-01-25 13:22:00-0600'",
               "postcontent": " 'B26354, filed & monitored.",
               "__parsed_extra": null
           },
           {
               "_id": "d4f00dc8-6de5-4cc4-b00d-c86de55cc4a5",
               "userid": "2",
               "posttime": " '2015-01-25 13:04:00-0600'",
               "postcontent": " 'Is your license in-order, pal?",
               "__parsed_extra": null
           },
           {
               "_id": "77131a3c-c3d7-4aae-931a-3cc3d7daae17",
               "userid": "2",
               "posttime": " '2015-01-25 13:03:00-0600'",
               "postcontent": " 'I don''t get it. Why do they risk coming back to Earth?'",
               "__parsed_extra": null
           }
       ],
       "nextPageState": null
   }
}

As we can see, the Data API fundamentally changes the ways that we can work with our data. Exposing our Astra DB data with a RESTful API greatly simplifies the development process, and keeps them from having to focus on the database.

NOTE: The Data API will soon be available to Cassandra users!

Conclusion

The last decade has brought many exciting changes and innovations to the Cassandra ecosystem. Innovations such as SAI, the Data API, and vector search have fundamentally changed the way we approach our data model. We can only speculate what the next ten years of Cassandra will bring, but some recent browsing through the Cassandra Enhancement Proposal (CEP) list indicates that more, game-changing improvements aren’t too far away.

This started out as a story of an old blog post, a distributed database, and someone who was just trying to score as much Stack Overflow reputation as possible during the “Great NoSQL Wars” of the early 2010s. But mostly, this story is about looking at where we’ve been, recognizing how we’ve grown, and appreciating where we are today.

“All that is visible must grow beyond itself, and extend into the realm of the invisible.”

Dumont, “Tron” (1982).

Want to use these next-generation data modeling features with your Cassandra database? Have a look at the CEP list and see what’s on the horizon. Or, check out DataStax Astra DB, and try them out today!