Apache Pulsar Performance Testing with NoSQLBench
There are two traditional ways of doing performance testing with Apache Pulsar. One is with the pulsar-perf command-line tool that’s part of the Pulsar distribution. Another is through the OpenMessaging Benchmark (OMB) tool – a generic testing framework for a variety of messaging platforms including Apache Pulsar and Kafka.
From a workload generation and performance testing perspective, both tools can be used to simulate high-throughput message publishing or consuming against one or multiple Pulsar topics. Pulsar-perf can also simulate other types of Pulsar workloads like Pulsar reader, WebSocket producer, and managed ledger.
Both tools, however, have limitations when it comes to simulating realistic production workloads.
- Support for message keys is limited. In OMB, there’s no easy way to specify message keys as part of the workload generation. On the other hand, pulsar-perf only supports key generation either as auto-increment or purely random
- Both tools can only test Pulsar topics within a specific tenant and namespace
- Neither tool supports message schema. They can only use the default binary format as the message payload type
- Message payload is either randomly generated or statically assigned for each message
- The OMB tool is only available on AWS public cloud platform. This may not represent the actual Pulsar production infrastructure environment
- Not all tuning parameters are available to fine control the Pulsar client execution behavior
So there’s a need for a new performance testing tool suitable for simulating real production workloads. This is why we created NoSQLBench Pulsar Driver. Here’s an introduction to the NoSQLBench tool and how the NoSQLBench Pulsar Driver simulates a realistic production Pulsar workload.
Introducing NoSQLBench
NoSQLBench is a generic workload generation and performance testing tool for a distributed system. It started as an internal performance testing tool for Apache Cassandra® at DataStax.
It gradually expanded to other distributed systems like Apache Pulsar. NoSQLBench was open-sourced in 2020 under Apache 2.0 license (APLv2).
Compared with other performance testing tools for a distributed system, NoSQLBench has some unique features and benefits.
- Recipe-oriented procedural data generation through built-in function flows. Data can be both statically and dynamically bound
- Deterministic and repeatable workload behavior
- Modular protocol support via driver extensions for different types of distributed systems
- Configuration (YAML) based testing scenario design and access pattern modeling using the target system’s native approach
NoSQLBench workloads are defined in YAML configuration files. At its core, a NoSQLBench workload is composed of statements. Each statement can have its own parameters. A set of statement blocks can be grouped together under a common name and selected by tags.
Data used in each statement can either be statically or dynamically bound. Each binding represents how one particular piece of data is generated via a series of built-in functions.
The example YAML file below – hello_world.yaml – shows the basic structure of a NoSQLBench workload for Cassandra. Two statements that can be grouped together under a common phase, main. One for Cassandra write and another for Cassandra read.
Cassandra write/read ratio (4 vs. 1) is specified by the statement parameter ratio. The data for Cassandra write and read is dynamically bound according to the binding function flows.
bindings:
cycle: Identity()
cyclename: NumberNameToString()
sample: Normal(100.0D, 10.0D)
randomish_cycle: HashRangeScaled()
blocks:
- tags:
phase: main
statements:
- insert-sample: |
insert into hello.world (cycle,name,sample)
values ({cycle},{cyclename},{sample});
ratio: 4
- read-sample: |
select * from hello.world where cycle={randomish_cycle}
ratio: 1
To execute this workload, simply run the following NoSQLBench command to trigger one million Cassandra operations with 800K writes and 200K reads.
nb run driver=cql yaml=hello-world.yaml tags=phase:main host=<Cassandra_host_name> cycles=1M
The NoSQLBench tool has more features and capabilities than we can cover here. You can learn more in the NoSQLBench documentation and this NoSQLBench tutorial by Jonathan Shook.
NoSQLBench Pulsar Driver in detail
NoSQLBench provides a common framework for workload generation and performance testing. The actual workload execution against a particular distributed system is achieved via custom-built drivers. This is what NoSQLBench Pulsar Driver is designed to do for performance testing against Apache Pulsar.
At a high level, we can achieve the following testing objectives with the NoSQLBench Pulsar Driver that are hard or impossible to implement with other existing Apache Pulsar performance testing tools.
- Concurrent message publishing and consuming for topics under multiple tenants and namespaces
- Realistic production workload simulation with fine-controls of message keys, message properties, and message payloads
- Complex Pulsar message schema types like Apache Avro
- Track end-to-end message processing latency from message publishing to receipt
- Detecting abnormal message processing errors like loss, duplication, and out-of-order
- Easy integration of rich client-side metrics with Pulsar cluster metrics and a holistic view of the system performance behaviors
Deterministic workload definition
The Pulsar Driver defines Pulsar workloads in YAML files like in the generic NoSQLBench workload definition pattern. At a high level, a Pulsar workload YAML file has three main sections: bindings, params, and blocks.
- Bindings define how message data is actually generated. For this part, the Pulsar driver follows the generic NoSQLBench data binding principle
- Params define the document-level configuration settings that may impact multiple or all Pulsar workload statements. More detail on this in the next section
- Blocks define Pulsar workload details as NoSQLBench statement blocks. Depending on Pulsar workload type, the statement block content may be different
A template YAML file is like this:
bindings:
… …
params:
… …
blocks:
- name: <statement_block_1>
tags:
phase: … …
statements:
- name: <statement_name_1>
… …
- name: <statement_name_2>
… …
NoSQLBench Pulsar Driver currently supports the following types of Pulsar workload.
- Create or delete Pulsar tenants, namespaces and topics
- Message reading
- Message publishing
- Message batch publishing (only relevant when using Pulsar sync API)
- Message consuming (also using topic list or pattern)
The definitions of these Pulsar workload types as NoSQLBench statement blocks are described in detail in the NoSQLBench Pulsar Driver documentation.
Complete workload behavior configuration
The behavior of the Pulsar workload execution is impacted by configuration settings that can be set at different levels described in detail below.
- Global
- Document
- Statement
Global Level configuration
The NoSQLBench Pulsar Driver is built upon Pulsar’s Java client API. With this API, various configuration settings can be used to fine-tune the behaviors for the following.
The NoSQLBench Pulsar Driver supports all these configuration settings. You can define the desired configuration settings in a Java property file like config.properties with the right prefixes.
For example, for all client connection-related configuration settings, the prefix is “client.” For Pulsar producer-related configuration settings, the prefix is “producer”, and so on. To give you a better idea, here’s an example file of global level configuration parameters.
The settings under global level configuration will impact all NoSQLBench Pulsar workload definition YAML files. They’re provided as a NoSQLBench runtime execution parameter, config, as shown below.
nb run driver=pulsar tags=phase:producer cycles=1M web_url=http://localhost:8080 service_url=pulsar://localhost:6650 config=<dir>/config.properties yaml=<dir>/pulsar_nb.yaml
Document and Statement Level configuration
Another set of configuration settings can only impact one particular Pulsar workload definition YAML file. Depending on the scope of their impacts in the file, these settings are further categorized as Document Level settings and Statement Level settings.
Document level configuration settings are placed under the “params:” section of the NoSQLBench workload definition YAML file. They impact multiple or all Pulsar workload types defined in the YAML file. There are currently four Document Level configuration settings for the NoSQLBench Pulsar Driver.
One Pulsar workload type may include one or more statements grouped under different statement blocks. Each statement may have its unique configuration settings. The settings under one statement have no impact on other statements.
You can find a detailed description of the Document Level and the Statement Level configuration settings in the NoSQLBench Pulsar Driver documentation.
Realistic message simulation
When we use the NoSQLBench Pulsar Driver for Pulsar message processing testing, we get a realistic message workload simulation as close to a production environment as possible. We’ll use a simple oil and gas sensor IoT example to see how it works.
We’re assuming that each message corresponds to a piece of sensor data containing the following information.
- Message Key: The drilling bit identifier
- Message Property: Various static properties of a drilling bit
- Message Payload: Actual data collected by one sensor on a drilling bit
Simulating a message format in such detail isn’t possible with any existing Pulsar performance testing tools. But NoSQLBench Pulsar Driver makes it simple. Below is an example of a NoSQLBench workload definition YAML file.
bindings:
drill_id: ToUUID(); ToString();
prop1_val: AlphaNumericString(10);
prop2_val: AlphaNumericString(20);
sensor_id: ToUUID(); ToString();
sensor_type: HashedLineToString('sens_type_values.txt')
reading_time: ToDateTime();
reading_value: ToFloat(100);
params:
… …
blocks:
- name: producer-block
tags:
phase: producer
statements:
- name: s1
optype: msg-send
msg_key: "{drill_id}"
msg_property: |
{
"drill_prop1": "{prop1_val}",
"drill_prop2": "{prop2_val}"
}
msg_value: |
{
"SensorID": "{sensor_id}",
"SensorType": "{sensor_type}",
"ReadingTime": "{reading_time}",
"ReadingValue": {reading_value}
}
The message key is a UUID string representing a specific drilling bit ID. The message property is defined by a JSON string that contains a list of key/value pairs representing various drilling bit properties. The message payload is also defined by a JSON string that follows an Avro schema definition.
{
"type": "record",
"name": "IotSensor",
"namespace": "TestNS",
"fields" : [
{"name": "SensorID", "type": "string"},
{"name": "SensorType", "type": "string"},
{"name": "ReadingTime", "type": "string"},
{"name": "ReadingValue", "type": "float"}
]
The Avro schema definition is made available to the Pulsar workload by the following global level configuration settings.
schema.type=avro
schema.definition=file:///path/to/iot-example.avsc
Authentic multi-tenancy message processing testing
Real multi-tenancy message processing testing is another big advantage of using NoSQLBench Pulsar Driver over Pulsar performance testing tools. We can get this by dynamical data binding to Pulsar topic names. For a bit more context.
- Statically binding means the Pulsar topic name will be the same for all NoSQLBench execution cycles. Each cycle represents one particular operation like publishing a message. This is the single-topic testing we know from other Pulsar testing tools
- Dynamically binding means the Pulsar topic name is different for each NoSQLBench execution cycle. The actual Pulsar topic name per cycle will follow the NoSQLBench data binding principle
In the next example, we create 100 Pulsar topics under 10 tenants with two namespaces per tenant and five topics per namespace. The Pulsar topic names that are generated follow the pattern of persistent://tnt[0~9]/ns[0~1]/t[0~4].
bindings:
tenant: Mod(100); Div(10L); ToString(); Prefix("tnt")
namespace: Mod(10); Div(5L); ToString(); Prefix("ns")
core_topic_name: Mod(5); ToString(); Prefix("t")
params:
topic_uri: "persistent://{tenant}/{namespace}/{core_topic_name}"
When running a NoSQLBench Pulsar message publishing workload with one million execution cycles (publishing one million messages), NoSQLBench will make sure that the one million messages are evenly distributed among 100 topics under 20 namespaces of 10 tenants.
Conclusion
We’ve explored a powerful, open-source performance testing utility for Apache Pulsar: NoSQLBench Pulsar Driver. It has distinct advantages compared with other Pulsar performance testing utilities. It enables deterministic, realistic, and production-like Pulsar workload simulation. Execution is easy and flexible.
Another distinguishing feature is to detect abnormal Pulsar message processing errors like message loss, message duplication, and message out-of-order. The deterministic nature of this utility’s workload generation and execution makes it easy to capture, analyze and debug these errors. This feature has helped our Pulsar engineers uncover a few Apache Pulsar bugs in some rare, racing-related situations. It has other useful features that we haven’t covered here.
We made it open-source – and wrote this post – in the hopes that the Apache Pulsar community will benefit from the powers of NoSQLBench Pulsar Driver. We look forward to seeing how it helps your performance testing, messaging model design, sizing analysis, and cluster deployment evaluation for applications and systems based on Apache Pulsar.
Follow DataStax on Medium for more posts on all things Pulsar, Cassandra, streaming, Kubernetes, and more. To join a buzzing community of developers from around the world and stay in the data loop, follow DataStaxDevs on Twitter and LinkedIn.