CompanyJanuary 27, 2021

Why Apache Pulsar as a Service is Essential to the Modern Data Stack

Jonathan Ellis
Jonathan EllisTechnology
Why Apache Pulsar as a Service is Essential to the Modern Data Stack

Messaging has been on DataStax’s radar for several years. A significant motivator for this is the increasing popularity of microservice-based architectures. Briefly, microservice architectures use a message bus to decouple communication between services and to simplify replay, error handling, and load spikes.

Message bus in microservice architecture

Message bus in microservice architecture 

With Cassandra and Astra, developers and architects have a database ecosystem that is:

  1. Based on open source
  2. Well suited for hybrid- and multi-cloud deployments
  3. Available in a cloud-native, consumption-priced service

There is no current messaging solution that satisfies these requirements, so we’re building one.

Apache Pulsar vs. Apache Kafka: Strengths and weaknesses

We started by evaluating the most popular option, Apache Kafka. We found that it came up short in four areas:

  1. Geo-replication
  2. Scaling
  3. Multitenancy
  4. Queuing

Apache Pulsar solves all of these problems to our satisfaction. 

 

Apache Kafka

Apache Pulsar

Geo-replication

Single region. No support for cross-datacenter replication. Increased latency for clients outside the region where Kafka is deployed.

Geo-replication is built into the core server. Producers can write to a shared topic from any region.

Scaling

All segment files in a partition are required for replication in a new node. For this reason, adding capacity slows the cluster down, before it becomes faster.

Pulsar’s ledgers can be replicated independently of one another. A new storage node can be added simply by adding a new ledger.

Multitenancy

Single-tenant design

Multitenancy built-in at the core. Manage multiple tenants across multiple regions from a single interface.

Queuing

Does not support queuing. Uses a pub/sub messaging model where messages cannot be acknowledged out-of-order. So, a subscription cannot be shared across multiple consumers.

Supports pub/sub and queuing models. With queuing, processing order is not important and costs can be reduced as messages are load balanced across an arbitrary number of consumers. 

Let’s look at each of these in more detail.

Geo-replication

Cassandra supports synchronous and asynchronous replication within or across datacenters. (Most often, Cassandra is configured for synchronous replication within a region, and asynchronous replication across regions.) This allows Cassandra users like Netflix to serve customers everywhere with local latency, to comply with data sovereignty regulations, and to survive infrastructure failures. (When AWS rebooted 218 Cassandra nodes to patch a security vulnerability, “Netflix experienced 0 downtime.”)

The Kafka approach

Kafka is designed to run in a single region and does not support cross-datacenter replication. Clients outside the region where Kafka is deployed must simply tolerate the increased latency. There are several projects that attempt to add cross-datacenter replication to Kafka at the client level, but these are necessarily difficult to operate and prone to failure.

The Pulsar approach

Like Cassandra, Pulsar builds geo-replication into the core server. (Also like Cassandra, you can choose to deploy this in a synchronous or asynchronous configuration, and you can configure replication by topic.) Producers can write to a shared topic from any region, and Pulsar takes care of ensuring those messages are visible to consumers everywhere.


Pulsar's geo replication

Pulsar’s geo-replication enables producers to write and consumers to read topics from anywhere. 

Splunk wrote up a good overview of Pulsar geo-replication in two parts: one, two.

Scaling

It’s becoming increasingly difficult to predict data capacity needs. Ideally, you’ll be prepared with a flexible solution that can quickly and seamlessly scale to meet any data volume spike, no matter how large.

The Kafka approach

In Kafka, the unit of storage is a segment file, but the unit of replication is all the segment files in a partition. Each partition is owned by a single leader broker, which replicates to several followers. So when you need to add capacity to your Kafka cluster, some partitions need to be copied to the new node before it can participate in reducing the load on the existing nodes.

Kafka segment files

With Kafka, all segment files in a partition are required for replication in a new node. 

This means that adding capacity to a Kafka cluster makes it slower before it makes it faster. If your capacity planning is on point, then this is fine, but if business needs change faster than you expected then it could be a serious problem.

The Pulsar approach

Pulsar adds a layer of indirection. (Pulsar also splits apart compute and storage, which are managed by the broker and the bookie, respectively, but the important part here is how Pulsar, via Bookkeeper, increases the granularity of replication.) In Pulsar, partitions are split up into ledgers, but unlike Kafka segments, ledgers can be replicated independently of one another. Pulsar keeps a map of which ledgers belong to a partition in Zookeeper. So when we add a new storage node to the cluster, all we have to do is start a new ledger on that node. Existing data can stay where it is, no extra work needs to be done by the cluster.

Pulsar ledgers storage nodes

With Pulsar, ledgers can be added to storage nodes individually vs. replicating a whole partition.

See Jack Vanlightly’s blog for an in-depth explanation of Pulsar’s architecture and storage model.

Multitenancy

Multi-tenant infrastructure can be shared across multiple users and organizations while isolating them from each other. The activities of one tenant should not be able to affect the security or the SLAs of other tenants.

Fundamentally, multitenancy reduces costs in two ways. First, simply by sharing infrastructure that isn’t maxed out by a single tenant—the cost of that component can be amortized across all users. Second, by simplifying administration. When there are dozens, hundreds, or even thousands of tenants, managing a single instance offers significant simplification. Even in a containerized world, “get me an account on this shared system” is much easier to fulfil than “stand me up a new instance of this service.” And global problems may be obscured by being scattered across many instances.

The Kafka approach

Like geo-replication, multitenancy is hard to graft onto a system that wasn’t designed for it. Kafka is a single-tenant design.

The Pulsar approach

Pulsar builds multitenancy in at the core.

Pulsar multitenancy single interface

Pulsar enables multitenancy from a single interface.

Pulsar allows us to manage multiple tenants across multiple regions from a single interface that includes authentication and authorization, isolation policy (Pulsar can optionally carve out hardware within the cluster that is dedicated to a single tenant), and storage quotas. CapitalOne wrote up a good overview of Pulsar multitenancy here.

DataStax’s new Admin Console for Pulsar makes this even easier.

Queuing (as well as streaming)

As we’ll see below, streaming and queuing are each important for distinct use cases. Implementing both has typically required separate messaging systems, creating extra operational overhead to deploy and manage them. Pulsar solves this headache by housing streaming and queuing under one roof.

The Kafka approach

Kafka offers a classic pub/sub (publish/subscribe) messaging model -- publishers send messages to Kafka, which orders them by partition within a topic, and sends a copy to every subscriber (or “consumer”).

Kafka orders messages by partition

Kafka orders messages by partition, within a topic, and sends a copy to every consumer. 

Kafka records which messages a consumer has seen with an offset into the log. This means that messages cannot be acknowledged out-of-order, which in turn means that a subscription cannot be shared across multiple consumers. (Kafka allows mapping multiple partitions to a single consumer in its consumer group design, but not the other way around.)

This is fine for pub/sub use cases, sometimes called streaming. For streaming, it’s important to consume messages in the same order in which they were published.

The Pulsar approach

Pulsar supports the pub/sub model, but it also supports the queuing model, where processing order is not important and we just want to load balance messages in a topic across an arbitrary number of consumers.

Pulsar queuing model load balances

Pulsar’s queuing model load balances messages across an arbitrary number of consumers. 

This (and queuing-oriented features like “dead letter queue” and negative acknowledgment with redelivery) means that Pulsar can often replace AMQP and JMS use cases as well as Kafka-style pub/sub, offering a further opportunity for cost reduction to enterprises adopting Pulsar.

Benefits of Pulsar as a service

After weighing all the advantages Pulsar has over Kafka, you may be wondering about the best way to get started. To get the most out of their deployment, many companies are turning to a managed service option, Pulsar as a service. Pulsar has the advantage of being open source and highly configurable to best meet the needs of your company. While that’s good news, it can also quickly make things complicated. You’ll start with a blank slate with the opportunity to configure every area of the software. That can be overwhelming. To get the most out of Pulsar, you need specialized experience and expertise. Pulsar as a service options, like DataStax’s Astra Streaming, provide a turnkey solution and a dedicated, experienced team of messaging experts. Trying to pull this off in-house would require hiring and retaining an expensive team and requiring them to stay up-to-date with Pulsar training and certifications. Pulsar as a service removes these burdens from your plate, so you rest easy knowing your system is being managed and monitored by a trusted partner. Instead, you can focus on what you do best—building applications and delivering value to your customers.

Let’s take a look at some additional benefits of Pulsar as a service.

Visibility

It’s important to use a service that includes real-time monitoring, providing you with a continuous window into the health of your Pulsar system. It should provide performance insights, and alert you to potential risks to your deployment. At DataStax, we use Pulsar Heartbeat to track numerous metrics and ensure things are running at full strength.

Ease of implementation

While Pulsar is a breeze to use for your end users, implementation and configuration is complex. If it’s not set up correctly, you can easily run into a sea of errors. The top Pulsar as a service providers know how to leverage all those options to best meet your needs and ensure everything runs correctly in your environment.

Pay as you go

With Pulsar as a service, you only pay for what you use. The level of service can scale up or down to best meet your needs.

Automatic upgrades

Using a managed service option also relieves you of the nuisance of dealing with never-ending upgrades and patches. Instead, everything is always kept up-to-date for you, ensuring your system is as stable and secure as possible.

24/7 support

With the best Pulsar as a service options, a team of experts will have your back, solve any issues that pop up, and help optimize performance 24 hours a day, seven days a week, 365 days a year. That would be awfully difficult and expensive to achieve in-house.

Conclusion

Pulsar’s architecture gives it important advantages over Kafka in geo-replication, scaling, multitenancy, and queuing. DataStax is excited to join the Pulsar community with today’s announcement of our acquisition of the Kesque Pulsar-as-a-service and open-sourcing the management and monitoring tools built by the Kesque team in our new Luna Streaming distribution of Pulsar.

Learn more about what Pulsar can do for Cassandra, and what Cassandra can do for Pulsar:


Want to try out Apache Pulsar? Sign up now for Astra Streaming, our fully managed Apache Pulsar service. We’ll give you access to its full capabilities entirely free through beta. See for yourself how easy it is to build modern data applications and let us know what you’d like to see to make your experience even better.

Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.