[Webcast] It’s Clouderiffic! Migrate From Open Source Cassandra to Astra on GCP.
Investing in Google Cloud? Do you love the scale-out, active-everywhere Apache Cassandra® NoSQL? Looking to migrate to an cloud-native enterprise-grade offering?
In this recorded webcast we discuss:
- How to deploy DSE using the GCP marketplace
- Connecting to DSE using CQL Shell
Interested in taking DataStax Astra for a spin? Sign up for the free 5GB Astra tier and start building!
Introduction
Matt Kennedy (00:02): Good morning and thank you for joining us for today's webinar, Migrate Open Source Cassandra to Astra. My name is Matt Kennedy. I'm a product manager here at DataStax and I'm joined today by Andy Goade, one of our brilliant data architects. Before we get into our topic, I wanted to go through a few housekeeping items. Everyone is muted and will remain so for the duration of the presentation to avoid any background noise. If you have any technical difficulties, please attempt to refresh your browser. In addition, if you have any questions throughout our presentation, we encourage you to submit them via the Q&A widget and we will do our best to get to them during the latter portion of the presentation. You can also find resources related to today's topic on the right side of screen, which we invite you to explore and download.
Cassandra: The Best NoSQL Database Choice
Matt Kennedy (00:47): Today, we're going to cover topics that are important when you're looking at porting an existing Cassandra app into Astra. Later on, we'll be taking a closer look at how we can use a GCP tool like Dataproc to help us move data around. For now, let's jump right in and look at the first thing that we'll need to do which is convert our connection code. With Astra, we introduced the concept of a secure connection bundle. And what that allows us to do is very, very simply get a connection to an Astra database. Now, this is similar to code that we would write for Cassandra to connect to a Cassandra cluster we run ourselves, and Andy's going to take us through the steps that are important there in how we do that conversion and show an actual example of a sample app that he converted. Andy, it's all yours.
Easy to Connect & Migrate
Andy Goade (01:40): Hey, thanks Matt. I guess first to start with, as Matt said, there's a really easy way to connect to the Astra database. There's several options that you get once you've created your database. The first one is the secure connect bundle and that has all of the information included in it. The address of the Astra database, the ports needed, the certificates and everything to be able to connect to that Astra database. And that is downloaded as an easy zip bundle when you first go into your database. And then there's also REST token authorization that allows you to do either REST API access or GraphQL access. So I was pretty interested once Astra came out about how easy it would be to go and convert an application.
Twissandra Demo
Andy Goade (02:29): There's a application called Twissandra that's essentially a Twitter like application based off of Cassandra as a back end. So I went and forked that code and it was pretty well written so that just like a lot of major or modern applications, that connection code is in a single location where it's very easy to port between databases. So I found in that Twissandra code where the old Python code had a connection to a local Cassandra instance that whoever created that application was connecting to that database. And to convert it to connecting to Astra was very easy. I simply had to add three lines of code including that secure connection bundle to be able to access the database. So what I'm going to do now is take you through how I actually ported that database and show you an example of it running on Astra.
Andy Goade (03:36): Hey everybody. Wanted to make a quick video today on how easy it is to port an existing application using a local version or server version of Cassandra to using DataStax Astra. So I found this project called Twissandra. It's essentially a Twitter like application but uses Cassandra at the back end. So I went ahead and forked that and just had to change a couple of files. It was really easy to do. There's this cass.py. All I had to do was add this plain text auth provider and then choose the credentials from using a local cluster to using the secure bundle that you get with Astra and username and password authentication. And then there was one other spot that I had to change and that was under this tweets, management, commands, and then this sync_cassandra. So again, just had to put in the secure connect bundle and the username and password and change that. And it was really, really easy to do.
Andy Goade (04:46): Of course, I have to have an Astra account for this so I'm just using my free version of Astra that I'm using. I have a separate keyspace under my killer video. I'm going to go ahead and show you that. Just to show you the keyspace here. And it's of course created all the files and everything already, which the Twissandra app will do for you. So we have all that. And we can see this is just the users that it created and then some of the tweets that it created.
Andy Goade (05:23): Okay. So here's the app here. It's pretty neat. You can see there's a public timeline here. You can click on a user and it shows you all of their tweets. You can login and you can post a tweet. We'll just call this one third test. Tweet. This is the tweet and you see it'll show up on my timeline. So pretty cool app to mess around with. The one thing I will point out is if I go back to public you won't see my tweet there and that's because when the application generated data, it generated data in the future so these tweets are probably in the year 5,000 or something. Definitely something you can go in and tweak in the code, but just be aware of that. Anyways, hope you enjoyed the video and learned something.
Matt Kennedy (06:19): Awesome. Thank you Andy. The one thing I do want to highlight about that secure connect bundle before we move on to other topics ... The code that we showed was for code running on a trusted network. And what you don't see in those cases are all of the SSL context that has to get set up to establish the secure connection to a database. So we in Astra do everything secure by default. You establish a very strong connection to the Astra endpoint that uses a two way SSL handshake. And all of that configuration and everything is taken care of in the bundle and there's no additional SSL complexity that you have to add to your code. So another kind of nice thing about that.
Guardrails
Matt Kennedy (07:17): With that said, now let's go into guardrails. Really these are codified best practices that help people build better Astra applications. These are features that make everything more stable, safer to use, they keep you from doing bad things as a developer that you don't intend to do. But sometimes without knowing all of the rules of a system we can do things that are a little bit risky and this'll help us avoid those situations. So to cover what we're going to look at in these next few slides, I'll highlight the specific guardrails of Astra that are unlimited features in open source Cassandra. So if you were running Cassandra you would not see these limits. My personal opinion is that I think that Cassandra probably should have prescribed limits for these things a long, long time ago. You can put a two gigabyte blob file into a Cassandra column cell and it'll work, but why are we doing something like that? That's going to cause a lot of risk if we ever try to retrieve that value.
Adapting to Astra Database Limits
Matt Kennedy (08:43): So what we have done in Astra is we've said for the majority of our use cases there are reasonable limits that we put in place for the best behavior of the database. So we'll go through each of those here but this is the summary view. So starting with how we adapt to 200 tables per keyspace for example. This one is interesting because what we're saying here is when you run in Astra you're going to have an upper limit to the number of tables. And in fact we warn after there's a total of 100 tables in the database. And I would encourage you to stay below that limit. But really that's not as problematic a thing when you can rely on a service like Astra versus having to run databases yourself. So one of the things that we can do if we have more than 200 tables in our Cassandra cluster and we're looking to port that to Astra, is we can say right, what are my access patterns to these tables? Is there anything I can do to take advantage of the fact that these tables in many ways are independent sets of data? This is not a relational database. I'm not going to be doing joins across these tables.
Matt Kennedy (10:05): So I've probably already got relatively independent sets of data in the tables. I don't have to have them in the same database. So for example, for my largest, busiest tables that need really, really low latencies consistently, I can isolate those in their own cluster or in their own Astra database with more resources allocated to that database. And then I can have potentially a much smaller, lower cost database in Astra that is there for all of those supporting tables. And the change that I have to make in my app at that point is I need to manage different connections for different collections of tables. An easy way to keep that organized is to use prefixes on the tables that kind of match the database names just so you can keep things straight as you're coding against two connections rather than one connection. Or there's always the microservice route where an individual microservice is only going to be connecting to a single database at a time and any coordination that happens is done in the client layer where you're going to be accessing independent microservices.
Matt Kennedy (11:19): So if you are looking at porting a Cassandra application to Astra and you're apprehensive because you've got more than 200 tables to deal with, look at this as an opportunity rather than something you have to find a workaround for. There are almost certainly ways to improve your overall application performance by splitting things up into multiple databases and it may not be as much of a logistic headache as you think it is.
Matt Kennedy (11:49): Moving on to looking at column sizes. So here we're talking about the size of the data that can fit in a single column value cell. So in other words, if I have five column values in my table and each can be five megabytes, then that row can be 25 megabytes of data, if I have five megabytes in each. So this isn't even that much of a terrible limitation. With collections and blobs we do reduce that a little bit to one megabyte. But think of a collection as something where you intend to have many, many entries that will eventually take up quite a bit of space if they're all large. So that's why we have these limitations here. What I would encourage you to do is test for this in your unit tests, make sure that you are testing inserts or updates that are beyond these limits so that you know what your code does when something is inserted that is beyond this limit. And if anything, most of what you're adding to the database should be much smaller than this.
Matt Kennedy (13:02): So really if a data element is that large, ask yourself if it belongs in a database or should it be maybe in object storage and what you should do instead is store the URL for that element inside the database. If it's an immutable object that is that large, object storage makes a lot of sense. We need the database that supports the random updates to individual rows more for when we're not dealing with a large immutable object. A common use case we see here is when you're dealing with media management. There's a lot of use cases for using Cassandra as the metadata system for something that streams audio or video. And in many cases those full fidelity files that you're going to be dealing with are very large and they are going to sit on object storage that is really geared towards serving those out at high performance. But for a lot of that metadata it makes sense to keep that in Cassandra. It also makes sense to look at things like thumbnails for files as the elements that would go into Cassandra rather than the full fidelity files. So consider that as an option for addressing these limitations. Put the really large immutable stuff in another system, store the URLs, store the metadata and that is going to likely result in an overall higher performance system as you port from Cassandra to Astra.
Matt Kennedy (14:44): The other one we want to look at is the limit on the columns per table and the limit of materialized views per table. So this is an interesting one. We do want to hear feedback on these specific limits. Part of the reason that there are limits here is simply that it is a reasonable thing to limit in a database because it helps users keep partitions to a relatively smaller size than really completely unconstrained column numbers. This tends to be an issue when we see people, not necessarily porting apps from Cassandra, but porting apps from Dynamo. And the advice that you hear in the Dynamo world is try to put everything in one table. So that's not necessarily the case when dealing with Cassandra applications or anything that we are going to port to Astra. So if that is the case, breaking down those single table data structures into multiple table data structures often gets you as far as you need to go with this.
Matt Kennedy (15:54): That said, we're very much interested in hearing people that need something beyond these limits. We have raised them before. I think it's likely that we will raise them again. So please, if you are someone that tends to make use of really, really wide column sets in a row in Astra, please reach out to us. The materialized view one is interesting because this is heavily impacted by our introduction of SAI a couple of weeks ago. SAI, to remind you, is our new storage attached indexing engine, which brings a completely new high performance indexing engine to Cassandra. And before that, materialized views were a great way to make additional read tables off a master read table. So if we were inserting all of our data into one master table and we wanted to have different ways to query that data, what we would do is we'd create a materialized view. And with no additional code, the materialized view gives us additional tables we can query that have different key structures. And it's really those key structures that we're querying on in the absence of SAI. So now that we have SAI, the materialized view has a much more specialized purpose, which I will talk about in the next slide here.
Matt Kennedy (17:23): A materialized view becomes something that we can continue to use for really, really low latency read situations. And that's all we need it for. We don't need it to support all of our queries. So what we can do instead is consider refactoring some materialized views into indexes on that master table. Now, the trade off we're going to make is our read latencies are going to go up a reasonable amount on indexes that go through queries as opposed to materialized views, but we get better consistency semantics over the data that we are querying from a master table via a index than we do going through a materialized view. So when we talk about these limits for SAI now because SAI is going to be solving some of these problems for us, we have to talk about the old school C2I or Cassandra 2I, which is the indexing mechanism that came out with Cassandra in, I believe it was 0.7. originally written for Thrift. Certainly helped solve a few problems, but over the years developed a reputation as being a tool that caused more problems than it solved. And so I would really encourage you to make this zero Cassandra secondary indexes per table even though the system will allow you to do one.
Matt Kennedy (19:00): You get 10 SAI indexes on those tables. If you haven't seen the last webinar that we did a couple of weeks ago, we went really in depth on some SAI examples. I do encourage you to go back and check those out. The important thing to remember here is data modeling no longer starts with knowing your queries for Cassandra. You don't have to de-normalize everything in order to support query patterns that you don't know about when you begin your process. And so that means that when you do have to add a query later, it is a much easier problem to deal with. In many cases all you have to do is add the index, whereas prior to SAI, really you'd have to take a look at whether or not that query could be supported by any of the tables you have so far. If not, how far off is it? Can I make a materialized view or do I have to completely add a new table that I'm going to manage manually and keep in sync with my other tables? So SAI is a real game changer here. We provide 10 indexes per table and you can combine.
Matt Kennedy (20:18): So each index is for a column, but you can combine queries for multiple columns into one query. So if I have three columns I want to query on and I have an SAI index on each, then I could write one query that hits each of those columns and none of them have to be in the primary key. So major, major changes for Cassandra and Astra with SAI. On a performance basis, you are going to see much better performance on the write side than we would have seen with C2I or if you were a DataStax enterprise user even than we see with the DSE search integration. So we have basically a 43% throughput improvement on write and a 230% latency improvement on write with SAI versus Cassandra secondary indexes. The side effect of that is that there is less in flight on the coordinator nodes for every write that you do. And what that means is we are churning through operations faster. We're able to deal with more standard writes in that same time because of that. So it's a massive efficiency improvement for the Cassandra landscape and we're very excited for people to start using it. If you have any questions, please reach out to us on the Aster app. You can always get in touch and set up a one on one session to go through data modeling problems that you may be thinking through.
What is Cloud Dataproc?
Matt Kennedy (21:56): So with that said, I want to turn our attention now to Dataproc. Dataproc is a way to run Spark in GCP. For those of you that aren't familiar with DataStax Enterprise, we have had an integrated Spark system for a long time with DSE, which makes a lot of sense for our self managed users. When people are consuming Astra as a database as a service, what they really want to do with Astra is use the native Spark services that exist in different cloud providers or the third party ISV Spark processors like Databricks for example. And Dataproc is the way we process Spark in GCP. So once again, I'm going to hand it over to Andy to take us through the Dataproc demo and we'll talk about some of the use cases and how we use that a little later on. Take it away Andy.
Andy Goade (22:55): Yeah. Thanks Matt. I love everything about those SAIs because it really helps with some of the use cases that you may be using for your Spark or Hadoop applications with the data science and kind of looking through different columns to see different bits of information that might be interesting to you. So those SAIs really help in indexing that data and getting at it at a different angle. So just a little bit more about Dataproc. It's really easy to configure for your needs. Really fast to deploy. What I really like about Google is they give you all these multiple ways of doing it. So you can write a script to go and deploy it if you're doing it multiple times. They also give you a REST API that allows you to deploy.
Andy Goade (23:46): And also, if you're new to it, you just go into the console and click through and pick the different options that you need. You could get something up and running in a cluster within two minutes, which is really nice. So it gives you all the standard Hadoop and Spark tools that you need but it also gives you the optional additional components like Jupyter or Zeppelin type notebooks. And it's also highly customizable to what you need for whatever your use case is. So you can add your applications in there or you can add different add-ons and stuff.
Dataproc Integration
Andy Goade (24:20): So for this demo what I'm going to show you is I'm using the Google Dataproc to move data from one table in Astra to another table in Astra. So this might be helpful if you're trying to pre-populate a dev or test or QA environment or if you're changing the keyspace or the way that your data is partitioned on disks, you may need to move data through that. So the demo, I based it off of Astra as of a week ago. We're constantly adding new features so it's always good to make sure that you kind of have that sanity check on, hey, this demo was produced, did something change or is there a new option or something. Same thing for Dataproc. Again, it was based off of last Wednesday and a version of 1.4 of Ubuntu was used for it. I also have a git repo out there that has the full setup instructions on how to walk through setting up Dataproc to work with Astra. And it's also packages, some data science notebooks that we've run in previous webinars that you may want to play around with and work with that Dataproc cluster.
Andy Goade (25:39): So we'll start the video now to show you the demo.
Dataproc Demo
Andy Goade (25:48): In this demo we're going to walk through using Google Cloud platform's Dataproc solution to migrate data from one Astra database to another. This might be helpful in a scenario where you want to copy maybe production data into a development system or you may be migrating from one database to another for any number of reasons. For those of you that don't know, Dataproc is Google's Hadoop or Spark as a service. As you can see here, this is the console. I don't have a cluster created but I do have a script here that I'm going to use to create a cluster. We'll walk through this pretty quickly. I'm just going to point out some of the specific things that need to be done for creating the cluster so that it can interact with Astra.
Andy Goade (26:39): One of the first things I want to do here is I want to enable this component gateway. This isn't specifically for Astra but it's so that we can use Jupyter notebooks and get to their web interfaces through the Dataproc cluster. And I've also specified this bucket that I'm going to use connected to my Dataproc cluster and that has some information already in it. Let me show you that bucket right here. One of the things that I have already set up is this notebooks Jupyter folder and that's basically just a location for all of my Jupyter notebooks that will automatically sync with the Dataproc cluster's Jupyter install. I've also created a folder here called AstraCreds. And what that does is it has the secure connection packages to Astra for the different databases that I'm going to connect to. It also has this connection package shell script. And I'll show you here what that shell script has in it. Basically, it's going to copy those credentials onto each of the nodes in the temp directory of the Dataproc cluster. So just allows that connectivity to be local on each of those nodes in the cluster and creates that connection package there.
Andy Goade (28:02): The next several lines are different region that you need to be in, subnets, machine types, number of workers, et cetera. I want to point out the properties here. There's two properties that need to be set. One is a security property that needs to be set to false. This is just because we've seen some irregularities with abilities to move data in and out of the cluster. And then also the Spark jars, we want to make sure we add the Spark Cassandra connector into that cluster as well. As I mentioned before we're going to use Jupyter to move the data using Spark from one database to another and as a prerequisite of that Anaconda also needs to be installed. There's a couple of initialization actions that I'm going to run. The first one is that Astra connect package shell script that I showed you before. The second is a ... Actually it's a Google initialization script that is region specific. But it's essentially to run a pip install for Python packages.
Andy Goade (29:14): And here are the additional packages and the metadata that I'm going to install. For this demonstration we don't necessarily need the Cassandra driver, but I'm going to install it anyways. But we will use Pandas and Google Cloud storage. So I'm going to go ahead and copy this. And in this console I'm going to go ahead and run this create. So what'll happen is once that's created it will show up in this console that it's creating.
Andy Goade (29:45): But what I'm going to do now I'm going to go to Astra and show you the two databases that we're going to be working with. I have this database called Google with a keyspace of prod and a second database called GoogleDev with a keyspace of dev. Don't have any tables or anything created in those so what I'm going to do is do that now. Click on the Google. For those connection packages you can see next to this authentication, this is where you download those connection bundles. But what I'm going to do is go ahead and launch Developer Studio. And I already have a notebook set up but usually what'll happen is it'll prompt you for your credentials to get into the studio. I've already done that so I'm going to go into this notebook here. And just to show you how it was set up or what's currently in the database. We really just have the keyspace with the replication definition. So I'm going to go ahead and create this city_temps table and just run a query here to show that there's no data in that table. So what I'm going to do is I have another script here that I'm going to run and this is a DSBulk script.
Andy Goade (31:12): Essentially it's a tool that DataStax has created to bulk load data into different tables and databases. It could be downloaded here at downloads.datastax.com. You can see there's the DataStax Bulk Loader there. But essentially what it does is it says I want to load data from this CSV file into this keyspace and table and this -b is the secure connect package for that production database and the username and password. So I'm going to go ahead here and copy this. And in this terminal window I'm going to go ahead and run that command. While that's happening I'm going to go back and set up the dev database. So we'll go and do the same thing here. Again, secure connection bundle there. Going to launch the Developer Studio. Already have a notebook here. Just to show that there's nothing in this I'm just going to describe that keyspace. There's nothing there so I'm going to go ahead and create the table for this keyspace and just run a query here to show no data's in there. All right. So going back to the DSBulk we can see that records have been loaded fairly quickly. 2.8 million rows. 30 seconds.
Andy Goade (32:56): There is some duplicates in my dataset so you're going to see about 2.8 million or so records when I do my count in Spark. But it's not too far off of that 2.8 million records. So if I go back to my production keyspace now I can run a distinct and see that I have some records in there now and those are the seven distinct regions that are in my dataset. Okay, so we can see that the Dataproc cluster has been provisioned. We'll go take a look here at the console. We can see it's up and running. So I'm going to go ahead and click into that and go into my web interfaces. And launch my Jupyter notebook. I already have a Jupyter notebook created called ... I called it Prime Dev Dataset. That's to prime the development dataset. And I have several steps here that I'm going to be running through to take the data from the production Astra database to the development Astra database. So the first thing I need to do is import some modules from Python. I'm going to go ahead and create some variable names and populate those to be used in some of my other commands.
Andy Goade (34:39): Now I'm going to create the Spark session with those connection properties. And you can see here that I have the secure connect bundle name. Remember those were copied into the temp directory so I have /temp and the name of the secure connect bundle. And then my username and password table that I'm going to be reading from in the keyspace. So let's go ahead and run this. What that's going to do is that's going to populate this temps DataFrame reading from the Astra table in the production database. And once it's done I'll put this table row count. It's going to be about 2.8 million records. That should be done any second here. There we go.
Andy Goade (35:37): So now what I want to do is I want to take that DataFrame and I want to write it to my dev table. So I have these load options. Same table name that I created. The dev keyspace, which is dev. The connect package that I'm going to use for that database, the username and password. And just to verify, I'm going to go back over to my dev studio notebook and I'm going to go ahead and run my select distinct again and I can see that there's no data. So let's go back to this notebook and go ahead and run this Spark load. Now that that's complete, I'm going to go ahead and create a second Spark session and DataFrame just to get the count of the records that are in that dev table. And I can also go back to my studio session and run the same query again to verify that there's data in there and there we go. There's the seven regions that I was looking at. So this Spark session has also come back with 2.8 million. And just to verify that they're equal I'm just going to run those Spark sessions again on both the prod and the dev to validate that those are there.
Andy Goade (37:06): And we can see the 2.8 million records there. And dev should be coming back any second now. There we go. We can see that those are equal. Just another feature I wanted to point out before the end the demo is we could see that we have our partitioning key and our primary key here. One of the nice new features that DataStax has just released and is contributing to the open source environment is storage attached indexes and that's for the case where you may want to select records with a qualifier here that is not necessarily in the exact order of this primary key. So if I wanted to run this select statement on city_temps where the state is Illinois, I'm going to get an error because I haven't put in region, country into this qualifier to get this result set. But what I'm going to do here is I'm going to go ahead and create an index on state. So instead of having to have all that other criteria, I can just use simply state to run my query. So let's go ahead and create that storage attached index. So it's going to run and it's going to take a second or two here to actually go and index that field.
Andy Goade (38:40): But once it's done, I should be able to go here and go ahead and run this query again and it's going to complete successfully because of that index and give me the records that I want to see. So again, this may be helpful to create indexes on columns that you may have previously had to have a solar core on or something. But it gives you another option to be able to get the data that you may want without fully qualifying it.
Andy Goade (39:15): Anyways, that's the end of the demo and hope you enjoyed it. Thank you.
Andy Goade (39:26): All right. So just to summarize what I did there was four key things to remember. Change the properties. There's one property that took me a little bit to figure out is this conscript provider. It's a Java security property. But we've seen some inconsistencies with that so it's better to just turn that off so it's not used. The next property is, again, the Cassandra Spark connector so you can use Astra with Dataproc. Initialization actions. Copying that secure connection bundle to each of the nodes is going to save you a lot of headache. And then having a Google storage bucket pre set up if you're going to be using Jupyter notebooks and with a script to move that. And then the metadata stuff, adding any of the pip packages, Python packages that you need. And then of course you can add on to that list for whatever you need to use for your project your application.
Andy Goade (40:31): Other useful commands. Enable the component gateways. That allows you to get to the web interfaces of those additional components that you're adding. So if you wanted to look at the Spark console, again, the Jupyter notebook, make sure you have that checked. And then any of the additional other components that are there, go ahead and take a look and see if you need those as well.
Matt Kennedy (40:56): Awesome. That is a really cool demo Andy. And I kind of want to ask, you're relatively new to DataStax and this was one of the first projects you did. How long did it take you to work out from starting at point zero how to get everything wired up with Dataproc and Astra?
Andy Goade (41:16): Yeah. Great question. It was really pretty easy. A lot of the information that I found was out there on the web. There's obviously a lot of information that DataStax puts out there as far as how to do things. So I kind of just had to stitch a couple of different things together. Ultimately, it probably took me a good day and a half, two days of solid work to go and figure things out and a lot of it was just nitpicky stuff like I put quotes around some strings that didn't need quotes. And just figuring that type of stuff out. But the Astra stuff was super simple. Easy to spin up the databases and everything. Google as well. Once I had those quotes and everything figured out, it was pretty easy.
Matt Kennedy (42:07):That is slick. And I think it is worth pointing out that you have this covered under your properties bullet that the connector that you use is the open source Spark Cassandra connector. So while in your demo it was two Astra databases, one of those could be an open source Cassandra database and the other an Astra database. So if you're running a Cassandra database in GCP, this is one way to move data from your open source Cassandra database to an Astra database if you're preparing for a migration say for example. Or you even want to, as you kind of used the use case example in your demo, you want to pull out a sample dataset from a production database to use as a development resource or a testing resource in a new Astra database that you're planning to port to. Are there any other interesting use cases for this kind of Spark tooling that you think is worth mentioning?
Andy Goade (43:07):Yeah, absolutely. I think one of the biggest use cases I think for this is if you have a large set of data scientists or something in your company, or people that need their own separate resources, it's really beneficial for them to be able to spin up their own Dataproc and Astra clusters to be able to kind of go and test code out or fiddle with whatever they want to do. Because then they're not stepping on each other's toes and it makes it a lot more efficient for them to get completed what they need to complete.
Matt Kennedy (43:39): Cool. I mean, kind of what you laid out there is if I am part of a team of developers that is trying to learn Astra, come up to speed, but we've got a Cassandra database, I could potentially use something like this even to copy a gigabyte or five into an Astra free tier database and everybody can have their own playground with a nice sample set of data to work with.
Andy Goade (44:08): Yeah, absolutely.
Matt Kennedy( 44:08): That's pretty cool.
Andy Goade (44:10): Yeah. And you kind of make sure you have a golden copy somewhere so that everybody's always consistently taking off of that same one so you know you have the base point, which is so important when you're doing a lot of the data science work. You need to have the consistent base point to make sure that the iterations that you're going to do are something that's going to give you the different results and you know what you started from.
Matt Kennedy (44:36): So, non sequitur question. I ask this because it sounds like you're coming a little bit from the data science background. I have a geospatial development background. And I think both of those groups make heavy use of NumPy and SciPy. But I'm wondering what your preferred pronunciation for those packages is because I like to call them numpy and scipy because I feel like they're cuter that way. Do you have a preferred pronunciation for NumPy and SciPy?
Andy Goade (45:09): Yeah, I'm the traditional NumPy, SciPy.
Matt Kennedy (45:13): All right.
Andy Goade (45:13):Yeah. So great question.
Matt Kennedy (45:16): Fair enough. Tabs or spaces?
Andy Goade (45:19): Spaces.
What’s Next?
Matt Kennedy (45:20):
Oh, really. Oh, man. I'm tabs all the way. I'm not sure we can work together after this so let's get through it. So what is next? Obviously we want people to have the tooling and the comfort level that they need to consider porting projects from open source Cassandra into Astra. We think it'll make your life a lot easier being able to completely give up the challenges of administering a Cassandra cluster and leave that to us so you can focus on the fun stuff like building an app. I do want to point out to everybody that we are soliciting signups for a free migration. So if you have a real app and you want to bring that into Astra, please click that top link on the what's next slide and get signed up for that. We'll be in touch as soon as we can. Essentially what's going on here, lest this sound like some kind of a there's no such thing as a free lunch quandary, what we are doing is building out tooling to make this a seamless, one click process for anyone to do a live migration from an existing production database to an Astra database without taking any downtime.
Matt Kennedy (46:49):
And in building out that tooling, we need serious test cases and that's where you guys come in. So if you have a serious need to move an existing Cassandra application into DSE, please check out that link and sign up and we will be in touch very shortly.
Q&A
Matt Kennedy (47:07): So with that, I want to turn it over to our questions here. And we have one question that came in about the guardrails. The question is, "Is the Astra limit 200 tables per keyspace or is it 200 tables per database? I was under the impression the Astra limitation was 200 tables per database." It's a really good question. The source of the confusion here is that when Astra launched it only supported one keyspace. So you would create a keyspace with your database and that was the keyspace. Now that we have taken the steps to launch things like multi region where we need to give users control over where their data is replicated to in case they want to keep data within a particular keyspace, we've added support for multiple keyspaces and hence the clarification in what the guardrail really is. It is 200 tables per keyspace. But we do warn much, much earlier than that and I do encourage people to stay under the warn limits.
Matt Kennedy (48:17): Looking at the next question, what if I'm still on Thrift? What if I am looking at porting a really old Cassandra database over to Astra? Astra does require CQL. It does not support Thrift. But is there an easy way I can look at a Thrift migration? I think the good news here is that a lot of the Thrift data models that I have seen at least, tend to be simple in nature. People don't get very complex with Thrift data models. And an easy way to emulate that in CQL for example, is to have a primary key or perhaps a compound primary key with a partition key, and then your columns. You could pull out some fixed columns if you know they're there and let them be conventional columns. Or you can use a collection type like a map. So really Thrift was a map of maps in the whole database. And you can achieve the same thing in a data model on CQL by using that primary key as the key for your map and then your column values are maps themselves. So that's one way to deal with the fact that Thrift has aged out of relevance and how to bring that database into a more modern state.
Matt Kennedy (49:58): Another question we have is, "IF my cluster is in nodes, how do I know how much Astra I'll need?" And there's a couple answers to this question. It's a really good one. So I've got an existing Cassandra database. I know how big it is. I know how much data's there. But how do I figure out the equivalent for Astra? And the answer there is you have a couple of approaches. One is to look at the data footprint. If you have two terabytes of data in your database and that's replicated three times so you've got approximately six terabytes of data in the whole database, what you'll want to do is take that unreplicated number, so the two terabytes which represents one replica set of your data, divide that by your capacity unit size. So we have two tiers of capacity unit size in Astra. One of those is the C tier, which is 500 gigabytes. The other tier is the D tier, which is 1,500 gigabytes. So one is geared towards higher density, the other is geared towards higher performance. You would essentially take your total footprint of that unreplicated data, and divide that by the size of the capacity unit.
Matt Kennedy (51:23): So let's say we're looking at a higher performance use case, I want to get a C tier database there so I have a higher ratio of compute and memory to data. So I will divide that two terabytes by 500 gigs and I get four capacity units. So I'll need four capacity units of the C series and then the question is where in the C series do I need to land from a performance perspective? And really the best thing to do there is to run a test workload. We can help you with that if that's something you need. So we use NoSQL Bench to take a representative sample of what your data model looks like and we run a simulation against that. And using that, we can really hone in on where you need to be with Astra. But it's also an exercise that it's relatively easy to do yourselves. You can essentially figure out how many capacity units you need and look at your workload. If you feel like it is a low number of operations per second, choose C10. If it's really, really high, choose C40. If it's somewhere in the middle, choose C20. And there's always going to be variation based on your data model.
Matt Kennedy (52:45): So what I would do is pick that starting point with the requisite number of capacity units you need, run a test, see where you are with latency. It may be the case that simply adding a capacity unit redistributes data enough and changes that ratio of memory and CPU to data that it allows you to come back into a performance profile that works for you. So all of that said, you can take the quick route, you can take the long route. If you need help with either, please reach out to us. But the relatively straightforward patter there is figure out how much data you've got and then map that into capacity units in Astra.
Matt Kennedy (53:35): Let's see. We have one more question which is how do I plan for a live migration? That's a great question. It is a bit detailed to really answer it thoroughly. But to give you some idea, one of the things that you want to do if you're going to do this yourself is first, port the app without the data. Make sure all of that works. Make sure you have good test cases. And then what you'll want to do is move over some sample data and ensure that everything still works with your sample data. Finally, when you're ready to really migrate... And again, I encourage you to reach out [inaudible 00:54:18] doing this at no cost for people right now that have the right kind of use cases that we're looking for. But what you would do is you would set up an application system that does dual writes to both databases. And there are some limitations to when you can use dual writes. You can't use it necessarily everywhere. It gets a little bit tricky if we have lightweight transactions in the mix. But for the most part, you can do dual writes to two systems while you've also moved some data over in the background.
Matt Kennedy (54:56): And when those systems are synced up, then you effectively start sending reads from your primary system to the other system. And eventually, once you're sure that the secondary does everything that it needs to, you can shut off that initial database and consider yourself live migrated to Astra. But as I said, there's a couple of tricky steps in there so please reach out.
Matt Kennedy (55:20): With that, we have had one more question come in. "If Astra is running in AWS, would the Dataproc setup be similar to what we have seen in the demo?" And the answer there is yes. You're going to have a different secure connect bundle, but all the details that are required would be in there. So the Dataproc cluster running in GCP could very well reach out to a Astra database running in AWS. Or really any cloud for that matter.
Matt Kennedy (56:00): So that is it for questions. Before I wrap up, Andy, anything else that you wanted to add to any of those answers or anything?
Conclusion
Andy Goade (56:12): Yeah, I think the advice that I would give is all of this stuff is really easy to go and figure out and start just playing with. And with Astra giving a free tier and being able to connect really from anywhere, it's really easy to go and pick a small project and just start working with the database.
Matt Kennedy (56:36): Awesome. Yeah, absolutely. And please do reach out when you do that. If you get stuck, if you just need to share an idea and ask if something will work, we are available directly within the Astra app via the chat tool or you can schedule time on our calendars to discuss your ideas. So with that, we are out of time for questions and we're actually out of questions. So thank you for joining us all today and we would like to invite you to future DataStax events at datastax.com/webinars. And sign up at datastax.com for more information on Astra. Lastly, be sure to check out the last webinar in this series, Spring Into Action Using Astra on Google Cloud. Thank you very much everyone.