CompanyApril 21, 2020

Get your Definitive Guide for Apache Cassandra 4.0

Get your Definitive Guide for Apache Cassandra 4.0

I first met Jeff Carpenter a few years ago by way of Eben Hewitt, the original author of the O’Reilly Media book “Cassandra: The Definitive Guide.” He was passing the torch to Jeff for a much needed update on a very outdated book. I think it’s safe to say that the 2nd edition was instrumental in helping expand the Cassandra community. From in-depth operations coverage to building applications and data models, it lived up to its title as “The Definitive Guide.” To keep from repeating sins of the past, Jeff has been working on the 3rd edition updates to keep the book relevant as Cassandra continues to evolve over time. I got a chance to talk to him about the book and some of the motivations and expected updates.  

Patrick: The 3rd edition of Cassandra: The Definitive Guide releases on April 21, 2020. Why a new edition?

Jeff: The 2nd edition was created to coincide with Cassandra 3.0, so when the community started moving toward a 4.0 release last year, it seemed like the time was right to start working on an update, especially since it had already been about 3 years since the 2nd edition. 

Patrick: Yes, it has been a while. I remember helping you out with the Apache Spark section last time and since then a lot with the project has happened. 

Jeff: That’s right, that section on Spark integration was literally the last pages written of the 2nd edition, and funny enough it was the last section written for the 3rd edition as well. Of course, it’s all about DataFrames instead of RDDs now, and I had to put my Scala hat back on.

Patrick: That’s one change! Interesting that you decided to focus this edition on 4.0 even though it’s still in Alpha? 

Jeff: Right, but I did feel pretty comfortable about that since Cassandra 4.0 is feature frozen. 

Patrick: Tell me about some of the key highlights of 4.0 that you cover. Let me guess, transient replication, virtual tables, am I on the right track?

Jeff: Yes, there’s not one big showstopper feature in 4.0 like there was in 3.0 with materialized views, and we didn’t have to rewrite the storage engine or anything drastic like that. There’s a lot more performance, usability and stability type of improvements, so things like asynchronous internode messaging with Netty and zero-copy streaming make Cassandra much faster at moving data around when your topology changes, for example when you add or remove nodes. Virtual tables represent a promising way to make monitoring and managing Cassandra easier by accessing settings and metrics via Cassandra Query Language (CQL) instead of Java Management Extensions (JMX). Transient replication is a cool mechanism Apple introduced for storing fewer replicas and saving on hardware costs, and of course there’s a new internal query logger that enables full query logging and audit logging. From a security perspective there’s network authorization and hot certificate reloading. So there’s a lot of goodness in this release. Plus there’s a serious step up in testing rigor that’s promising to make this a “dot zero” release that users can actually put in production!

At the same time, there have been a good number of changes in the DataStax Java Driver v4 API, so I had to give all the application code samples in the book a refresh as well.

Patrick: But there’s also more that’s changed around the database outside of just the binaries right? Were people even running Cassandra in Docker yet when you wrote the 2nd edition? Now we’re talking in the community around building a common Kubernetes operator and management sidecars.

Jeff: You’re right, the situation has changed significantly. When I started revising the book, I had to laugh when I re-read the Docker section. It was very conservative and gave all kinds of caveats about running your production Cassandra clusters in Docker, and now people do it all the time. I updated that guidance and added a section on Kubernetes operators, the emerging Cassandra Enhancement Proposal process, and the prospect of CEPs for management sidecars and a Kubernetes operator.

Patrick: And Apache Kafka has really come onto the scene in a big way in the past few years.  

Jeff: That was a fun new addition to the integrations chapter, discussing how Kafka and Cassandra are complementary technologies and the different patterns for using them together. For example, using Cassandra as a source or sink for Kafka topics, or using Kafka to choreograph interactions between different Cassandra-backed microservices. The other major new section in that final chapter of the book is based on a great suggestion you gave me: to summarize the process involved in migrating from a legacy relational database to Cassandra. I think it really ties the book together well, so thanks for the inspiration on that one!

Patrick: I understand you’ve got a whole collateral ecosystem brewing around the book this time. Working microservices, online exercises, workshops, and so on.

Jeff: This happened without any master plan on my part. After the 2nd edition of the book was released, O'Reilly asked me to create an online training course around application development with Cassandra. As a result of that I ended up creating the Reservation Service, a reference microservice implemented in Java that uses Cassandra as its data store. Cedrick Lunven helped me modernize this service to be based on Spring Boot. I also added a new chapter to the book using the Reservation Service as an example of how to design microservices using Cassandra.

Now I’ve started taking all of the code samples from the Reservation Service and all of the CQL tutorials and turning them into Katacoda scenarios that you’ll be able to access on the O’Reilly Learning site. 

Patrick: Katacoda is a fantastic online learning environment, I hear we’ll be seeing this on some DataStax websites soon as well. It does take a community to pull something like this together, anyone else you’d like to thank?

Jeff: The first person to thank is my co-author Eben Hewitt, who gave me the opportunity four years ago to update the book for the 2nd edition and has been a great encouragement for the 3rd edition as well. The beginning of the book is still only minimally changed from Eben's fantastic work in Chapter 1 of the first edition. It’s not only the best-articulated proposition for the emergence of NoSQL that I've read, but it’s some of the best technical writing I've encountered, period. The minor changes I've made there were to include the appearance of the "NewSQL" databases like Google Spanner.

I will also highlight the contributions of my technical reviewers like Alex Ott, Wei Deng, Pankaj Gallar, and Cedrick Lunven who spent who knows how many off hours reviewing the text. Between them they contributed over 400 comments and suggestions. They really kicked my butt and made this a better book.

Patrick: Now that everyone is excited for your update, how do people get their hands on this? Do they have to go buy it?

Jeff: I’m really pleased with how O’Reilly and DataStax have partnered to make this content freely available. DataStax has sponsored the book so for a limited time you can get a free copy from the DataStax website. Based on a great idea from Nate McCall, we were also able to donate the data modeling chapter to the Apache Foundation so you can access that as part of the Apache Cassandra project documentation. Plus, a shortened version of the new chapter on designing microservices with Cassandra has been published on DZone. So no, you don’t have to go buy the whole book, there’s a lot of material available for free.

Patrick: Thanks for spending all the nights and weekends to make this important update. I’m not sure what you’ll do with all the free time you have now that you have finished! The Apache Cassandra community will sure appreciate your labor of love. Can’t wait to see it out in the wild. 

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.