CompanyJanuary 29, 2021

[Webcast] What does the future of data on Kubernetes look like?

Scott Regan
Scott ReganGrowth marketing
[Webcast] What does the future of data on Kubernetes look like?

DataStax Chief Strategy Officer Sam Ramji and leading technologists discuss who manages data in Kubernetes, their biggest challenges, tools that are available, and what the next five years look like in a recorded virtual pancake breakfast hosted by Alex Williams, Founder and Publisher at The New Stack.

In recent years, developers have prioritized cloud-native development, building microservices-architected applications. To prevent data sprawl, more and more of those microservices are ending up inside of containers, with developers using Kubernetes to orchestrate them.

These trends have massive implications on data management — to the point that Sam believes that the 2020s will be the decade of data. According to Sam, the 2000s focused on scale-out networking, and the 2010s was the era of simplifying and standardizing compute.

“We’re at the precipice of the 2020s, and this feels like the decade of data,” Sam said. “Now that we're building on top of those large billion-elements scaling systems, how do we make data fluid? How do we make it containerized? How do we make it Kubernetes-native and cloud-native? Those are the kinds of things at the edge of practice that I see right now.” 

The promise of Kubernetes

The conversation also featured Tom Offermann, lead software engineer at New Relic, and Mya Pitzeruse, a software engineer at effx. 

For Tom, the future of data management in Kubernetes is exciting because it enables developers to reclaim a ton of time, which can then be invested in other more important areas.

“Think about what typical management tasks are in sort of the old world and old data centers,” Tom explains. “Like a node goes down, that becomes a page and a manual activity to repair the cluster, to bring up a new host, to copy data over. In Kubernetes, in the cloud, using an operator that can manage a cluster, a lot of those tasks become completely automated and don't require any intervention of an engineer at all. That sort of capability is a pretty big leap forward.”
 

Getting started with data in Kubernetes

Unfortunately, you can’t just put data into Kubernetes and expect fantastic results. You need to set the stage first. Having been through the process multiple times, Mya knows a thing or two about preparing your environment ahead of data migration.

“I’ve had my cloud provider blow my entire cluster,” Mya said. “So, getting back up some place is probably the first big recommendation. You can’t get too attached to the data living in the same cluster indefinitely. You need to have it regularly backed up in S3 or some other bucket service.”

Regardless, figuring out how to effectively manage data in Kubernetes is “really, really challenging,” according to Sam. Luckily, there’s no shortage of talented devs working on this problem.

“I think over the next couple of years, we'll start to see data engineers, ML ops, data ops, all start to standardize in Kubernetes,” Sam continues. “They're going to ask, ‘What's a container for data? And how do we make this stuff an order of magnitude simpler?’ So we've got a little ways to go. But it is going to be about speed and consistency.”

Check out the full conversation to learn more about data management in Kubernetes and the kinds of tools you can expect to emerge in the coming years, as developers focus on simplifying the process and reducing toil for devs and operators alike.

Transcript:

Alex Williams (00:11):
Hello, everybody. It is once again, pancake breakfast time. Our topic today is, what is data management in the Kubernetes age? In other words, pass the syrup, Cassandra. It's pancakes time with Sam Ramji, chief strategy officer at DataStax. Sam, good to see you here again today.

Sam Ramji  (00:40):
Good to see you, Alex. Long time, no pancake.

Alex Williams (00:43):
A long time, no pancake is right. Too long. And also joining us is Mya. Hello, Mya.

Mya Pitzeruse (00:53):
Hey, Alex. I'm happy to be here again.

Alex Williams (00:57):
Mya is software engineer FX. Thanks for joining us, Mya. And Tom Offermann, lead software engineer at New Relic. Hey, Tom.

Tom Offermann (01:12):
Hi, Alex. Thanks for having me.

Alex Williams (01:17):
We're here virtually, but it's always a good time for pancakes. Is it not? I brought my pancakes. Did everyone bring their pancakes today?

Sam Ramji (01:26):
I brought my syrup.

Alex Williams (01:28):
You brought your syrup, Sam? Can you pass me this syrup, please?

Sam Ramji (01:31):
Here you go, Alex.

Alex Williams (01:33):
Just hold on one second, Sam, because I need to get myself together here and I want to like... I just thought I join you in your room for pancakes, Sam. How do you like that?

Sam Ramji (01:45):
You stole my room, Alex.

Alex Williams (01:50):
Another pancake feat. This has pancakes as a service here. It may be like pancakes or the new osmosis, or like the new Star Trek kind of thing, or Star Wars spatula for that matter, who we can go quite a ways with the pancakes. So thanks for the syrup. I'm going to go back to my room, I think, for a minute. I feel comfortable here, but I want to say, Tom, did you bring any syrup? Did you have any syrup here today?

Tom Offermann (02:24):
It would be great if someone could pass me the syrup. 

Alex Williams (02:27):
Okay. Yeah, sure.

Sam Ramji (02:28):
Pancakes are just a delivery vehicle for maple syrup, so here you go.

Tom Offermann (02:32):
Absolutely. You can never have too much. Thank you, Sam.

Alex Williams (02:36):
Mya, I want to get an idea of what a Kubernetes pancake looks like these days. Can you give us a glimpse of the Kubernetes pancake?

Mya Pitzeruse (02:47):
Happy to. I figured since we had Kubernetes on the docket today, I would sit down and take a look at it.

Tom Offermann (02:52):
We're not worthy.

Alex Williams (02:54):
We're not worthy. Do you have syrup? Do you need some syrup too?

Mya (03:02):
I could use a little syrup.

Alex Williams (03:04):
Okay. Good. Well, let me pass you this syrup over here. How about that? Here you go.

Mya Pitzeruse (03:10):
Thank you.

Alex Williams (03:10):
Oh, you're welcome. You're welcome. Well, why don't we just take a quick bite of pancakes before we get into the discussion. Mine are vegan, gluten-free. I don't know about yours, but wow. That's a good pancake. My gosh. A robot did not make those pancakes, not this time.

Sam Ramji (03:31):
I missed the robot pancakes, but the robot never produced anything like Mya Pitzeruse's awesome Kubernetes pancakes.

Alex Williams (03:38):
No, no. The robot was on nearly that advanced. So our topic today is about data management in Kubernetes and we have Sam who brings that perspective from living in the land of Cassandra and living in the land of data management. We have Mya Pitzeruse. Mya is quite knowledgeable on the world of GRPC. I look forward to hearing about kind of the perspective there. And Tom, Tom, I know you all are big consumers at New Relic of Cassandra and Kafka too. So my question is, Sam, where are we right now? Where are we in this world of data management? I don't really think that there's any kind of concrete kind of understanding that you get when you think about something that's so new, really in that perspective. We talk about data all the time, but there's lots more to it than just the data itself, isn't there?

Sam Ramji (04:45):
Yeah. Yeah. There's nothing new under the sun in our industry, right? We're still working on a Von Neumann architecture and building on turning machines. All of that stuff is 70 years old. So they had data in the mainframe era. They had applications. They had network and everything. But what we do see is every decade or so, there's a simplification, a change that makes it a lot easier to use and consume whatever the resource is. So you see the 2000s and this era of scale-out networking.

Sam Ramji (05:14):
You got folks like Andy Bechtolsheim and others figuring out, how can you create an addressable network space that lets you talk to billions of devices. In 2010s, then you start to see the emergence of Docker, standardizing containers, Linux, bringing a lot more Linux to bear, and then Kubernetes standing on standardizing how you deal with Docker containers. So that's a decade of simplifying and standardizing how you do compute. So now, we're at the precipice of the 2020s and this feels like the decade of data. Now that we're building on top of those large billion elements scaling systems, how do we make data fluid? How do we make it containerized? How do we make it Kubernetes native and cloud native? Those are the kinds of things at the edge of practice that I see right now.

Alex Williams (06:03):
Before we go any further, I want to introduce my co-host for the show who will be asking all the questions. Well, let's give a round of scratchy laws for Joe Jackson. Scratchy laws. Hey, Joe. Joe, thanks for joining us. Joe has a few questions out there, but before we get started with the questions, Mya, I had been doing a little research and I saw that you did a presentation on GRPC and looking at it through kind of this perspective of data management in kind of concert with what Sam is talking about in terms of where we are right now. What is the interest in these new generations of these machine to machine technologies, such as you see with a GRPC?

Mya Pitzeruse (06:50):
I think a lot of it gets into, how do you efficiently move data around. A lot of the traditional systems have been bound to single machines and with this kind of new era of compute, we need to start working on taking our data workloads and spreading them across them as well, kind of getting back to Sam's points from kind of just a second ago where we've seen kind of this nice new age evolution. I think a lot of the... I don't know. I think there's been a lot of lessons on the stateless side of the world that are starting to make their way down into the stateful side. As a result, like we're seeing some really healthy growth. I'm running these really complicated stateful systems inside of these more ephemeral compute platforms like Kubernetes or Nomad, or even the managed ones in the cloud.

Alex Williams (07:45):
Yeah. That's kind of the story it seems like, that we've really been almost trying not to talk about until really, most recently with the staple environments. There's been a lot of work done with the interfaces, like the container storage interface and CNI. How has that impeded your work at New Relic or how has it forced you to advance your thinking, Tom?

Tom Offermann (08:15):
Yeah, it's interesting. I'm totally in sync with Sam here. I think like looking at our evolution, both across the industry and within New Relic, Docker was such a huge leap forward because it kind of answered the question of, what does it look like when you build software? What's the artifact? We sort of standardized on Docker images as being the standard artifact. Now, I think with Kubernetes, we're developing like what is the standard way to deploy and orchestrate and manage those Docker containers? I think that the early activity in Kubernetes was really was around stateless services. And now, I think we're kind of at the point where we're ready to tackle, how do you manage stateful service? How do you manage databases? How do you manage data stores on Kubernetes? And that's pretty exciting.

Sam Ramji (09:11):
There's a really good reason for that. So if you look back at where Kubernetes came from, it was building on a lot of experience in containerization and large scale infrastructure management at Google. So when you look back to Docker, I remember meeting Solomon Hykes in late 2009, when he had first moved to the Bay area, he was tremendously excited about what had gone into Linux containers. If you start at the beginning of that provenance, it's 2007 when Google was contributing C-groups to the Linux Kernel. So creating this isolated security envelope around these different elements starts to become a deployment unit and that's awesome. But Borg, as the internal system of Google was called and it still is, was only half of the story. The other half of the story was how Google engineers were able to access state that was really well and automatically managed and they didn't have to think about it because of the stateful services teams.

Sam Ramji (10:05):
When Google went to open source, the future of Borg, this perspective of Borg for everyone, Kubernetes, there was no parallel strategy for what to do with state. The idea there was is we increase the appetite for compute. People will use more and more cloudy stateful services, but you don't have to stuff that inside Kubernetes. Of course what's happened, and I had a bird's eye view because I was VP of product management at Google cloud platform at the time overseeing Kubernetes and cloud dev ops, we thought, "Well, we'll just take care of the compute and the data will take care of itself." But instead, as Mya said, that has pulled in a lot more orientation to, "Hey, why shouldn't data be as easy to use as the compute layer?" So bringing statefullness to the Kubernetes environment is kind of the big challenge and opportunity right now.

Mya Pitzeruse (10:57):
Yeah. I think one of the interesting points there is if you really look back, the test was one of the first stateful workloads to support running on top of Kubernetes and a lot of it stems from how the test was developed at Google running on Borg and being able to leverage a lot of the existing semantics there because the test wasn't really deployed any differently there. I remember hearing somebody talk about it where they were like as soon as Kubernetes announced GA for some of the lower-level persistence components, the test was like, "Yeah, we support Kubernetes for a deployment platform." So it's just interesting history.

Sam Ramji (11:39):
Yeah, it is an interesting history. I think of where we are now. For instance, there's a lot of talk about GPUs and that's about managing data really, really fast and being able to train the models to run really fast. I think back to when container technologies first emerge, we were talking a lot about that speed factor and the portability factor. But how were you managing data then compared to how you're managing it now, Mya? I'm curious and I'd love Tom's thoughts too.

Mya Pitzeruse (12:20):
Tom, do you want to start?

Tom Offermann (12:25):
Yeah, sure. I think for me, the before and after is really very much like what does it look like to manage data sort of in a traditional data center? Versus what does it mean to manage it in the cloud using Kubernetes? There are just a lot more capabilities in cloud Kubernetes. There are more opportunities for automation that makes our lives easier and makes us able better to perform in our jobs. I think about...

Tom Offermann (13:00):
... Able better to perform in our jobs. I think about what typical management tasks are in sort of the old world and old data centers. Like a node goes down, that becomes a page and a manual activity to repair the cluster, to bring up a new host, to copy data over. In Kubernetes, in the cloud, using an operator that can manage a cluster, a lot of those tasks become completely automated and don't require any intervention of an engineer at all. That sort of capability is, it's a pretty big leap forward.

Sam Ramji (13:44):
I think there's a question of whether you want to manage two worlds or if you want to manage one world, right? So to your question, Alex, if you don't manage state in a Kubernetes native way, then you're kind of segregating and you're piping it in from somewhere. One of the things that we did to bring Cloud Foundry and Kubernetes together was we took the Cloud Foundry Service Broker API, and we cleaned up the IP and we contributed to the community and that's called the Open Service Broker API. But still, a service broker means I've got my data over here and it's carefully managed, sort of more like pets than cattle, by a set of folks who really know how to deal with the concerns of scaling out stateful services. And then you string a pipe and then you put that into your compute environment and that's where your apps get at it. So that's kind of the old state of the art.

Sam Ramji (14:34):
The pressure that Tom is describing is, why do you want to have two worlds? Why can't we take those esoteric rules about the data for data's sake and move them into being recipes and Kubernetes operators and automation, so you have one compute plain, one control plain, and a data plain, frankly, that's application-aware, which is super hard to do if the two worlds are segregated.

Mya Pitzeruse (15:00):
Yeah, there's even coordination components from like an app devs perspective, too, where it's like, "Oh great, I'm deploying service X, but service X needs a database or [inaudible 00:02:11]. And it's like, how do you even go about deploying something that alongside your application, if your application is running in Kubernetes and then even like gluing all of the secrets into place. And if you're living in this split world, like one, your engineers now need to like maintain both Terraform and YAML, and there's not a lot of thought put into like the whole developer experience around it. And so like getting to this world where it's, yeah, everything's kind of just packaged up together and like running in its own kind of isolated way makes a lot of things better. Not to mention when you start looking at like, how long does it take for a VM to boot up in comparison to how long does it take a container to start up? Right. Like you have some, I don't know, carrots there, I guess.

Alex Williams (15:59):
So how does this affect the developer who's building out the microservices for instance, and they used to work with one database right. Now you might have multiple data stores working with each service in itself. You have that layer of Veda store there that you have to be managing. So how does that affect the developer?

Sam Ramji (16:30):
Well, one thing you think about is a good architecture has a separation of concerns, right? So is every developer excellent at everything? Generally not. Even if you have a two-pizza team, is everybody that you have in that 10 person team able to select, install, manage, deploy, operate the right two or three different data stores that you need? Maybe you need to know SQL data store, maybe in an SQL data store. Maybe you need a time-series database as well.

Sam Ramji (16:56):
These competencies then start to compost down and you start to build a data platform. One of the interesting little known rules in Google is... Right, we had when I was there in 2018, about 44,000 engineers, extremely competent and made more productive by the platforms we gave them. And yet only about 5% of the developers at Google were building stateful services. And if you were not one of the developers who built stateful services, and you decided you were just going to roll your own, let's be clear. You were going to be in deep trouble because if you kind of do your data scaling and audit and security by the school of skinned knees and bruises, you can put your company at a lot of risk. So separating the competencies into a data platform team, it's kind of the pattern that we tend to see with these kinds of classes of conversations, getting out of the tyranny of Microservices 1.0, where you just say, team, go wild. And then you realize, "Oh my gosh, we've got data sprawl. We've got problems with our service level objectives. Like our uptime's not where it needs to be. And we can't predict it." Then you start to say, "Well, it's time to get out of jail. Let's work with the platform engineers who can really sustain the kind of scale and observability and uptime that we need."

Alex Williams (18:11):
Before we go further into that discussion. We got Joe Jackson here. I think he has a question from someone out in pancake land.

Joe Jackson (18:21):
Yeah. When in fact we did get a question from domain Pruitt's and the question is, "What needs to happen before you put data into Kubernetes?" And specified that the answers could be either people, process or technology focused. This one goes out to the crew.

Sam Ramji (18:44):
Who would like to step up with that one? Mya?

Mya Pitzeruse (18:51):
Because I had gone through this probably two or three times. I've had my cloud provider blow my entire cluster getting back up some place is probably like the first big recommendation because working through that workflow is, for most operators, pretty transparent these days. It's like adding some additional conveyed, get it going. But just because when we start to talk about treating clusters more like cattle, less like pets, like you can't get too attached to the data just living in that same cluster indefinitely. And so like having that data regularly backed up. So like S3 or some other bucket service for later restoration, that can be used to spin up new clusters, that can be used to do complex migrations from one version to the next. You name it.

Sam Ramji (19:38):
So what are those stories? Yeah. So what are the stories you hear and like from your own experiences too, about how that data loss happens, is it in your cases, was it something on the cloud service side? What did you learn from that? What are some of the kinds of causes of these data losses?

Mya Pitzeruse (20:03):
So that they wanted for me was the cloud provider I was on deployed, had changed to their container storage interface driver and it broke permissions for, what is the copy XT4 on like the root of the driver or something like that. And so like it didn't break for every persistent workload I had. It only broke for a small handful of them, which was even more frustrating. And so like in the troubleshooting process, it wasn't clear exactly why this happened. And so when I went and started rebouncing pods, they weren't able to load data. And then even once they did finally load data, the boot record of the, this was post grass, the boot record for the PostgreSQL database was corrupted and it wasn't able to open itself back up. And so having just that regular nightly backup where it's like, I could have restored, would have been a huge lifesaver, but it doesn't fix the problem that the cloud provider deployed a bug to their storage interface that then had issues with it. Right. And it's like all software systems, right. Nothing is going to be a hundred percent bug free. We can always just hope for the best. And so finding ways to think around those problematic cases are major.

Sam Ramji (21:19):
Tom, you must have some war stories.

Tom Offermann (21:22):
Yeah. Well I was just thinking about the question that, I think that there is a lot that we could take advantage of building and deploying on top of Kubernetes, but we should recognize that like just having a Kubernetes platform available to us, requires quite a bit of work. And that means either you have to have dedicated people building and managing that yourself, or you need to have a cloud provider that and a managed Kubernetes service that you can trust. So, yeah. So I mean, it is complex if you're doing it yourself, there's a lot to manage. So it's kind of like there is a layer that needs to be in place, a foundation before you're able to take advantage of some of the goodness of managing stateful services on top of it.

Tom Offermann (22:12):
So you all must have seen backup change quite a bit then over the past few years. Backup used to mean a lot, it used to being quite something different. When I first started writing about enterprise technologies back in circa 2000, 2008, 2009, I expect that you learned a lot about backups when you were starting to work more in microservice environments.

Mya Pitzeruse (22:41):
I'm sorry. Can you please repeat the question?

Tom Offermann (22:44):
The backup question, the question about backups and how backing up data has changed and what are some of the things that you've noticed in how you have to think about data backup now? I mean, you talked about a cloud service, right? Where the interfaces didn't work so well. And so you had to think, now you have to think about the backups, but backups, I think, are treated a lot differently now than they were even before the era of cloud, which wasn't so long ago.

Mya Pitzeruse (23:14):
I think there's been like a bigger emphasis on like testing your backups and these days, I know that we had like a small effort kind of going through, like whenever a snapshot was taken of a database, that we would kind of go through and find a way to check some and make sure that the snapshot that we were taken were at least a little bit more consistent than what we were kind of initially expecting before. The kind of big thing there was you could trust that your database is like snapshotting properly, like, and just say, yeah, the database snapshot should be fine, or you can always just verify it, right. But that's just the way that I've seen it change. I don't think it's really changed too much. There is more of like a concept, I think for streaming backups versus like doing the periodic one-time snapshots. And that's definitely for the larger scale stuff that we've been working on.

Sam Ramji (24:12):
There's a chat with the folks from Mya data who are the maintainers of Open EBS and CAS and Data IO, which was recently acquired by Veem and the folks that are [inaudible 00:11:21]. And they're all focusing on Kubernetes native data management, including backup. Also Eric Hahn, who's now at NetApp. And of course you saw Portworxs, acquired by Pure Storage recently. And I think that the change that you're going to see, that you're already seeing, is radical application awareness as defined by helm charts. If you can't read the Helm Chart, then you don't understand the exploded application topology that Kubernetes is bringing, and then you can't really fully understand what you want to back up and then how to restore it, because all of the cardinality changes, right? How many of this for how many of that, what actual nodes are these different processes running on? What data access do they need when you're in a restore situation? So I found it quite fascinating to talk to those folks as well about the integration that they've had to do with Helm Charts and application architecture awareness in order to execute the restore on the backup or the backup and the restore process correctly.

Tom Offermann (25:22):
Well, great. Well, we're getting a little bit into the weeds a little bit. I'd love to bring it back into some of the main points here about how we think about data management. I would love to just get a perspective on how you think it's going to change over the next few years and Tom, is there any glimmers of what changes you see coming, for instance, the need for new tools or new capabilities with the operators or whatever it might be?

Sam Ramji (25:57):
Yeah. I mean, I think we're really at the beginning of establishing this, this operator pattern.

Tom Offermann (26:00):
This operator pattern for managing data stores on Kubernetes. And those operators are only going to get more and more capable. I think right now, they're very good at deploying a cluster, scaling it, recovering if a node goes down like I was describing. But what's exciting, at least in the Cassandra community, is all this activity around the next set of things that every operator needs, like having backups be automated and built in as part of the operator, metrics, those kinds of things. That pattern is being established and we're only going to continue to add more and more capabilities to it.

Alex Williams (26:44):
How's that pattern emerging for you, Sam?

Sam Ramji (26:47):
We've learned that we've had to put a lot more day two capabilities in our operator. So as the folks from, who first created the operator pattern, would say, there's a big difference between enablement, kind of the day one, like, "Look, I can run this database on Kubernetes," and day two operations, like, "Will it stay up for days and weeks and months and years?" So as both Tom and Mya have mentioned, like backup and restore, including a lot more different components around the database is important. So one of the things that we found is, we've run large-scale Cassandra, multi-tenant, and single tenant in all the major clouds. We've built a service called Astra, which is basically large scale, cloud-based Cassandra. It's all on Kubernetes. And we found that we had this kind of ecosystem of other projects that we've had to pull on and build in order to make it run properly.

Sam Ramji (27:45):
So packaging all those things into a distribution, which we call K8ssandra, K 8 S S A N D R A, includes things like Medusa and Reaper and metrics collectors, as well as signals that can be raised to Kubernetes and beyond, getting out of this idea that the database itself is a monolith, and kind of inverting it a little bit, making the cluster management a little bit more obvious, waiting for Kubernetes to pass you control again, and having all the affordances that you need around the kernel of the database, in order to actually scale the operator pattern. So the operator is great, not quite enough. You really need a distribution of operator helpers that can make the database work well in a Kubernetes control plane.

Alex Williams (28:32):
Mya, I'd love to hear your perspective, and then maybe we can go to Joe for a question.

Mya Pitzeruse (28:38):
No, I completely echo that. You see a lot of it in some of the Wells supported operators today. I've gone from the press labs, my SQL operator, zalando postgres operator, most of them come with some kind of sidecar or code process that's able to kind of help with the database administration and kind of cutting processes.

Alex Williams (28:59):
Sidecar is a good way to think of it, as a co-process, isn't it?

Mya Pitzeruse (29:06):
Yeah. Yeah. The whole orchestrator stack and... Gosh, what is it? There's a set of tools all around my SQL, like the Percona set up, all of that. It's all typically done as a sidecar. And kind of getting back, I think, one of your earlier questions around, how did that use to get managed versus how does that get managed in that Kubernetes' world. It's like running those side cars on the system is a pain.

Alex Williams (29:36):
Well, good. Well, Joe, what kind of question do we have from the crowd out there in pancake land?

Joe Jackson (29:43):
All right. Well, very curious, someone is very curious about Astro server-less, and how that works with GRPC and data streaming. So maybe, Sam, you could talk about how these things are interrelated?

Sam Ramji (30:01):
Sure. I'll do my best. So server-less is partly about how you get the unit economics of a service down. Something that we hear a lot from cloud users, as well as people who are building on premises environments that need to have cloud-like economics, "Do you have to commit to one server? Can you just run as much workload as you want? And will it auto scale behind the API?" And I think that's also the same question for the idea of GRPC. So one of the things that you'll find in distributed systems architecture is gateways are really helpful. So as you started to look at, how do we manage all these services, how do we get the right service tone, see Matt Klein and the lift folks open-sourcing Envoy, and Envoy, all of a sudden, becomes kind of a service proxy for most of the microservices that you run on Kubernetes. And that gives you a really nice place to attach into any of the service meshes that you care about, whether it's Kuma or Istio or anything that follows the SMI, some of Levine's work at Solo.

Sam Ramji (31:06):
So this idea of a gateway, a service control plan, a service mesh, all of these things connect in by having a really good service proxy with Envoy. So being able to start thinking about a data proxy, enabling people to build data meshes is really important. Just because you wrote the data in a particular way doesn't mean you should always have to address it the same way. So we're seeing people building things out, we're seeing this in Netflix, we're seeing this in Apple, we've copied the pattern and released an open source project called Stargate, where you should be able to point it at your existing data store and then say, "Hey, you know what? I want to JSON view of this that feels more like a document. I want a rest view of this. I want to graph QL view of this." And most importantly, for the high scale services, a GRPC view. And that's something that Mya is far more expert on than me to explain why you would want to be able to have these personalities, particularly GRPC, for the kinds of application affordances and operator sanity, perhaps, of being able to scale effectively.

Joe Jackson (32:12):
Now, Mya, I understand that there's some work going on with GRPC around, I guess, in effect, using it to perform load balancing and maybe API gateway functions. What's going or what's going on there?

Mya Pitzeruse (32:30):
So GRPC's had load balancing in, I think, from day zero or something like that. They've always been able to load balance across their backend instances. The big thing I think that's... Oh gosh. The big thing I think that's been changing more recently has been it's tighter integration with the XDS API's. So kind of hitting back to that service mesh, one of the core components in the service meshes the actual coordination APIs that are used the direct traffic from instance A to instance B. And so when you talk about a service mesh, most people talk about the proxy based mesh, which is an application, talks to a local proxy, which then handles the MTLS, the load balancing, so on and so forth to the backend instances, which then have a corresponding proxy in the backend instances.

Mya Pitzeruse (33:24):
As you could imagine, there's a lot of workloads that this adds too much latency around, like in particular around persistent workloads, where, if you had to jump through those hoops every time, it's going to become a hassle. The benefit to GRPC is being able to direct all of that without necessarily needing those proxies in the mix. And they announced that... Go ahead.

Joe Jackson (33:46):
Oh, I'm sorry. So it could potentially replace a service mesh at some point.

Mya Pitzeruse (33:46):
Yeah. That's one of the big proponents. I don't know if I would say replace the service mesh because the service mesh is very complicated. We talk about it in probably two key planes, the data plane and the control plane. And it doesn't replace the control plane component, it only replaces data plane component of that actual structure.

Alex Williams (34:14):
It's a lot about performance, isn't it, Sam and Tom? Tom, performance, it has to be a major consideration. And GRPC and Cassandra has a role in optimizing performance, but what are the other performance issues that you see cropping up?

Tom Offermann (34:35):
Yeah. So performances are definitely speed, but also a related ability to handle high volumes of data. Those are the things that we care about. That's sort of at the top of our list. How fast can we write data is the thing we care about the most and something like Cassandra works really well for us. In terms of Cassandra and GRPC, for our use case, we don't have a world where many, many services are talking to our Cassandra cluster. All of our rights are coming through one service. And again, it's designed for high throughput, write as fast as you can. And our queries on the read side are kind of similar. But yeah, I agree that performance is the number one concern when we're looking at a data store.

Sam Ramji (35:28):
And the useful thing about coupling something like GRPC with Cassandra, is that you think about the raw throughput that you're looking for from a database, and then you, you match that against traditional binding approaches, and you go, "Well, how much time am I spending marshaling and un-marshaling the data?" And then you think about how much developer time you're spending building out SDKs for each language to make the data endpoint consumable. GRPC, kind of by design, auto-generates stubs. It lowers the barrier of access. It also lowers the expectations for how you are going to talk to the database. So there's a nice harmony between GRPC and Cassandra, for sure. But Cassandra not the only database that is going to benefit from GRPC. Tom uses Kafka, and you use, of course, the new Relic database, the internal NRDB, which is a pretty darn strong time series database.

Tom Offermann (36:27):
Yeah, absolutely. We store up many different types of telemetry data, and we try to choose the right data store for each different data type and make sure that it's appropriate for the given shape of the data.

Joe Jackson (36:51):
Today, at AWS Reinvent, Werner Vogel, in his keynote, he had an interesting talk about bringing strong consistency to S3. Evidently they can do strong consistency now, whereas before it was eventual consistency. You didn't get the info about the data you placed in the S3 bucket right away. And in general, he had said that consistency on distributed systems is a really hard problem. They worked a lot of years on it. And I'm kind of curious as to, is this true? And is this a problem with... I know we're talking about performance, but is consistency up egg stumbling block for using Kubernetes or any distributed system?

Alex Williams (37:38):
Mya, you have any thoughts on that?

Mya Pitzeruse (37:43):
Anytime you talk about the distributed system, you're going to hit issues with dealing with cap. So the cap theorem consistency... Oh gosh, I always butcher this one. I haven't looked it in a while and I always feel like a fool.

Sam Ramji (37:56):
Oh, yeah. Consistency availability.

Mya Pitzeruse (37:59):
Consistency availability. And it's like, which one do you choose in the event of a network partition? Because the network partition is inevitable. And so how do you favor that? And most storage systems that we talk about today favor availability over consistency. And so that's how you ended up with... Or consistency over availability, and then you typically pair that with an AP system, like your auto scaling service. And that's how you're able to kind of like balance those trade-offs.

Sam Ramji (38:26):
Yeah, that's actually a great focusing point on Cassandra, because the database itself was invented 12 years ago when Facebook was trying to figure out how to do Facebook inbox. They had this really big problem with availability as they were trying to compete with basically Microsoft Outlook for emails. So you can imagine how hard it is. Get in the way back machine. It's 2007, 2008, the iPhone has just come out, and the world's standard is you're using offline email clients, particularly Outlook tied to Exchange, which does caching. So how do you compete? Well, you have to be hyper, hyper available.

Sam Ramji (39:00):
So, how do you compete? Well, you have to be hyper, hyper available. So they took the AWS Dynamo whitepaper and said, "How would we build a coordination layer that looks like Dynamo?" And then they took the Google Bigtable whitepaper and said, "How could we have this right privileged system, where rights are equally as fast as reads?" And by taking those ideas and algorithms together, they ended up creating Cassandra, which is pretty interesting.

Sam Ramji (39:26):
One of the things that you'll find is any distributed system does have consistency issues. So what do you do about it? And, in fact, there are tools in the Cassandra ecosystem that focus on what's called the repair process. Repair is actually an abbreviation, and I'll stop here for, for anti-entropy consistency and repair. So realizing that entropy is a thing that happens in distributed systems, you have to have a process that runs every so often, to say, "Is everybody okay here? All the 12 or 20 or 200 nodes in the cluster? Oh, we don't got this? Let me go fix that. Let's get the entropy out."

Alex Williams (40:01):
When entropy happens, what do you do? I think there's a question there for the ages. And I'm also curious about how this evolves, when we start thinking about the evolution of machine learning, which is really becoming quite a big part of how we think about distributed systems and distributed architectures. And we're just seeing it, from The New Stack, just seeping in everywhere. And so I'm curious on how this discussion really reflects on that. Who wants to take a shot? Tom?

Tom Offermann (40:44):
Well, not coming from a machine learning background myself, I mean, I think one of the big things about it is just the volume of data that you have to manage. Right? And so any sort of machine learning experiments modeling that you're doing, what's required is having a huge volume of training data to run that on. And so really the question is, what makes it easier to manage high volumes of data? Right?

Tom Offermann (41:16):
And I think that's where the tools, databases like Cassandra, and being able to manage those in cloud and Kubernetes, it allows you to scale out your data store, so that you can run that machine learning on top of it.

Mya Pitzeruse (41:33):
I think a big thing is managing your resources efficiently, right? If you look at the history of machine learning, they started on the MapReduce side of the world, where it's like, "We can't keep all of this data in memory. So we have to map it out to all the worker nodes, and then reduce it down into something that we can actually understand."

Mya Pitzeruse (41:50):
With things like stream based APIs inside of GRPC, you can actually get more efficient access to your data, both at the database layer, as well as during processing. And it helps keep your resource footprint lower. It helps you do a lot more things to that point of efficiency. But you can also manage the volume of data just significantly better.

Sam Ramji (42:14):
Yeah. And that's a great sort of interface through which look at. Can you feed the GRPC front end, right? Can you feed the GRPC API at the speed that the clients can consume it? So that speaks to the data stack. It also speaks to the Kubernetes wave that's kind of sweeping through the world.

Sam Ramji (42:33):
One of the first products that I had the privilege of launching at Google was the tensor processing units, which is all about ML. So, the question, why would you use a TPU? The same answer is why would you use a GPU? Which is, I want my model to get trained before the universe dies of heat death. So how can I just go and grab a whole bunch of specialized resources, GPUs, whatever, and get my training time down from maybe 200 hours, to 20 hours, or to 10 hours? Now, the liminal factor there, or the limiting factor there, may be your data throughput, a very basic machine learning model to learn a whole bunch of images. We've seen cases where typically you're going to want six gigabytes per second, maybe 10 gigabytes per second of throughput. That's actually fairly high, as raw throughput goes, but it's appropriate for being able to get your machine learning model trained fast enough.

Sam Ramji (43:25):
You combine that with geospatial multi-dimensional data that you're all trying to train on at the same time, you really want high throughput, so that you can train your model fast enough. And the thing to think about here is this isn't a flow process. So some of you may be familiar with Kub flow, the ability to put your tensor flow on your Kubernetes clusters. Again, that's all about being able to iterate, and have kind of a CICD approach to machine learning.

Sam Ramji (43:48):
You've got a model in production, but you want to update your model, just like you update code. How do you iterate the improvements in your model? How do you learn those? How do you do the training phase? How do you deploy that back to production? And how does that system end up pulling on similar classes and speeds of data that it was trained on? And this is where we started to try to harmonize the operational and the analytical systems, so that we're not doing this crazy data lake export of all the operational data, training your models on something that you're never going to see in production. And then hoping, on a wing and a prayer, that somehow it's going to make good decisions as it powers your API.

Sam Ramji (44:24):
So all that stuff is really, really challenging. I think over the next couple of years, we'll start to see data engineers, ML ops, data ops, all start to standardize in Kubernetes. They're going to ask, "What's a container for data? And how do we make this stuff an order of magnitude simpler?" So we've got a little ways to go. But it is going to be about speed and consistency.

Joe Jackson (44:46):
Hey, I wanted to ask Sam about Stargate, which DataStax just released. We've been getting a buzz about it at The New Stack. And the company describes it as the first open source API data gateway for modern data applications. And I believe you can do GraphQL, you can do JSON. Can you explain what it is, and the value there?

Sam Ramji (45:10):
Yeah, basically, it's an attempt to create an open source project that could talk to any backend. The backend, we understand, of course, is Cassandra, right? It's right here on the shirt. But it should be able to be pointed at any backend. And it's a pattern, as I mentioned before, that we've seen Netflix, we've seen at Apple, where you have a range of different application development communities. They have different APIs that they want to be able to talk, or API styles that they want to talk to the data through. How do they get there?

Sam Ramji (45:36):
So what we've learned in computer science is any problem can be solved by adding one level of indirection, right? That's how virtualization works. So you can think of that as what does a service proxy do? What does a data proxy do? Stargate's intent is to be a proxy, or a grownup proxy like a gateway, where you can modularly add new personalities, right? So you could do WASM or a Rust sort of module, to say, "Okay, here's the underlying protocol that I use to talk to the data store. Here's how I want to render it. And then let me run it for every request corresponding to the same security model. And can I run it at a line rate?"

Sam Ramji (46:15):
So we're pretty stoked to see the uptake so far, right? It's more important to land and be useful than it is to launch, right? You don't want to crash and explode on the landing, no matter how pretty the launch was. So it's our hope that a lot of people will grab it, that it will be a multi-party open source project, that helps people with the developers affordances to data that's being hosted in Kubernetes.

Alex Williams (46:45):
How do you see there's something of value to either you, Tom, or Mya?

Tom Offermann (46:53):
Yeah. I mean, I think that's right. I like the emphasis on making it easier for developers to access the data. And then there are different ways of doing that. I think, internally, we've solved the problem by funneling all requests, or reads, or writes through a certain set of services that sit in front of our Cassandra clusters. And other services that need to access that data all funnel through those services. Right? So we've worked out a solution that makes sense for us.

Tom Offermann (47:29):
But I think what we're talking about here is building a sort of more open source solution to that problem, which is lots of developers want to access the data, but they want to do it in different ways. And how can we make that possible?

Alex Williams (47:45):
Mya, any last thoughts before we go?

Mya Pitzeruse (47:47):
I can't tell you how many times in the last two weeks I've turned to our CEO and said, "I really wish we could write our front end API in GraphQL." I've worked on GraphQL for five years, and I've only really ever done the backend side of it. And I've never been big on the querying side of it.

Mya Pitzeruse (48:10):
But when I started to get into FX, and started to look around at a lot of the product analytics, and how a lot of that was being driven currently, a lot of it was going through their existing GraphQL layer, instead of going to the database. And every time I was trying to put together some kind of I'm like, "Gosh, I just really wish I could write SQL."

Mya Pitzeruse (48:27):
And so the kind of benefit here is we don't really care how you want your data. You tell us how you want your data, we will give it to you. We just have to take care of retrieving it from the backend. And that's one of the simple things here that's really appealing, where now that debate is over. Do we write a GraphQL endpoint? Or did we write a GRPC endpoint? It's like, "No, here's something that gives you it all for free. Go."

Sam Ramji (48:51):
Yeah.Right? We should all be happy.

Mya Pitzeruse (48:54):
We should.

Sam Ramji (48:56):
It's all about reducing developer and operator toil. Right? So anything that makes people's lives easier ends up getting used. And the pattern of writing a microservice in front of Cassandra, so that you can render the data the way that it probably should be rendered automatically, doesn't seem like a great use of people's irreplaceable heartbeats.

Sam Ramji (49:18):
So if you can have something that gives it 90% right, for these different modalities, these different ways of talking about data, then just let the robots do the work. Right? And then let the humans do what we do best, which is talk to each other and imagine.

Alex Williams (49:38):
I think that's a great way ... I'm sorry, what was that?

Mya Pitzeruse (49:38):
I was just going to say it even just pushes it closer to that kind of Google feeling, where storage is kind of solved. Right? You're not really working directly with the storage layer. You want a manifested view or a manifested interface on top of it, you go and build it, and in very similar ways. Right?

Alex Williams (49:55):
Well, I like how this is ending. We should all be happy and let the robots do the work. I think it's a good way to end the year here. This is being published in January. It seems so far away right now. But thank you so much. It's a great way to end our year, with a pancake breakfast with some of the smartest people that we've talked to, here at The New Stack.

Alex Williams (50:23):
So thank you so much, Sam Ramji of DataStax. Mya, your perspectives are excellent. We're so happy to have you join us for a second breakfast here. Mya [inaudible 00:50:35] is a software engineer at FX. And Tom Offermann. Tom, thanks for your perspectives at New Relic. And, yeah, let's give everyone a here. Way to go. Way to go. Thank you, Joe. Thanks, everyone. May the pancakes be with you.

Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.