Apache Cassandra vs. the Cloud Databases with Jonathan Ellis

DataStax CTO Jonathan Ellis compares the tradeoffs, strengths, and weaknesses of Apache Cassandra vs. Amazon’s DynamoDB, Microsoft’s Azure Cosmos DB, and Google’s Cloud Spanner.

Highlights!

0:15 - How we managed to get Jonathan on the show

0:45 - Apache Cassandra vs. the cloud databases

1:22 - Three things the industry (mostly) agrees on - 1) eventual consistency is the way to go

3:33 - 2) NoSQL is a bad term - everyone is using SQL variants now

4:28 - 3) Making users understand your partitioning model is the only way to ensure good performance - VoltDB is a case in point. There are data models where there is no good way to auto-partition in a way that is optimized for all queries

7:18 - How Cassandra answers the question of great performance at scale - 1) partition key plus clustering key has been in C* from the beginning

9:45 - 2) More recently, the addition of collections and user defined types allows more complex structures to be nested within a partition, and these structures can even be represented in JSON on writes and reads

11:00 - More about why Cassandra is awesome - 1) the requirement for a strict schema becomes important when you have more than one client accessing the data.

14:00 - 2) letting the server do denormalization for you with materialized views. Jonathan shares his opinion on how MVs can be used safely in production

16:45 - DynamoDB - the data model: each partition is a sorted map of items. No support for nested data types, still requires application side joins.

18:34 - Defining Local vs. Global indexes. With local indexes, adding nodes to the cluster doesn’t improve performance.

20:10 - DynamoDB’s global indexes are similar to Cassandra’s materialized views, but MVs also allow projection (filtering data and selecting a subset of columns).

21:13 - Why Cosmos DB is the database Jonathan admires most (aside from Cassandra) - Cosmos tried to learn from Cassandra’s mistakes such as having too many consistency levels - why do we have ALL or non-local levels?

23:02 - Jeff makes his pitch for having a “session” consistency level in Cassandra

23:21 - The unexpected multi-data center data access of quorum consistency level in Cassandra

23:49 - CosmosDB limits to 5 read consistency levels, which is still probably too many

25:02 - CosmosDB automatically creates local indexes for all tables, which is a great feature. We don’t know too much about the implementation as we do for other systems like DynamoDB, but we do know the indexes are probabilistic.

28:00 - Having local indexes for all columns in CosmosDB makes it possible to allow order by operations on every column, which is a powerful feature. What’s missing is support for denormalization via global indexes or materialized views.

29:00 Comparing CosmosDB and Cassandra are at rough feature parity, although Cassandra is stronger on materialized views. Integrating SASI indexes with the query path may allow Cassandra to catch up on indexing.

29:48 - What confuses Jonathan about CosmosDB - atomic operations

30:52 - Google Spanner is a bit different than the other databases we’ve talked about - it has a different architecture with the separation of compute and storage (which relies on Google’s distributed filesystem)

32:45 - Google Spanner has no concept of multiple rows per partition. The partition is the row.

33:23 - Google Spanner accomplishes performant joins via interleaved tables

34:08 - Full ACID transactions leveraging true time and multiple rounds of Paxos consensus algorithm - an elegant implementation of the wrong design?

35:14 - Jonathan analyses some published performance tests on writes in Google Spanner - the reduced write footprint is due to ACID semantics on all writes and file splitting.

38:13 - Jonathan’s recommendations: don’t use Dynamo DB due to data model limitations, don’t use Spanner due to write performance characteristics, his shortlist is Cosmos DB vs. Cassandra on features. The final advantage goes to Cassandra since you can run it in any cloud.

40:31 - Jeff asks for audience questions that we can put to Jonathan in a future episode