Running multiple DataStax Enterprise nodes in a single host
This article is about setting up a DataStax Enterprise cluster running in a single host.
There are a variety of reasons why you might want to run a DataStax Enterprise cluster inside a single host. For instance, your server vendor talked you into buying this vertical-scale machine but Apache Cassandra™ can't effectively use all the resources available. Or your developers need to test your app as they develop it, and they'd rather test it locally.
Whatever the reason, you'll learn how to set the cluster up from the ground up.
Multi-JVM Multi-Node HOWTO
The goal is to have a dense node: a single box running multiple DataStax Enterprise database only nodes in a single cluster.
The DataStax Enterprise cluster that we build in this blog post will consist of:
- 3 DataStax Enterprise nodes running an Apache Cassandra™ workload.
- A simple configuration without internode encryption
- Multiple interfaces (all virtual in this example). Each node will bind its services to its own IP address.
- Shared disks: all nodes will write their data and logs to the same disk. However, since data (or logs) directories can be any mount points, you can configure the nodes to point to different physical disks to improve performance for instance.
The resulting configuration will look like this:
- Single binary tarball installation: we'll install DataStax Enterprise once, and share it across nodes.
- Multiple node locations: each node will have its own directory hierarchy with configuration files, data, and logs.
Installing the binaries
Register on DataStax Academy. Use your download credentials to download DataStax Enterprise into a directory of your choice:
$ wget --user $USERNAME --password $PASSWORD http://downloads.datastax.com/enterprise/dse.tar.gz
After the download completes, unpack it:
$ tar zxf dse.tar.gz
The unpacked tarball creates a dse-4.8.0/ directory that is our DSE_HOME directory for this tutorial:
$ export DSE_HOME=`pwd`/dse-4.8.0/
Setting up the nodes
We'll first create a root directory per node. In these directories, the nodes have their configuration files, data, and logs. Let's also create the data/, and logs/ directories while we're at it:
$ for i in 1 2 3; do mkdir -p node$i/data node$i/logs; done
Next, we'll copy all the configuration files. For each service, first create the corresponding directory in the node's configuration directory, and then copy the files. For example, for Apache Cassandra™ workloads:
$ mkdir -p node1/resources/cassandra && cp -r $DSE_HOME/resources/cassandra/conf node1/resources/cassandra
Iterating over every service in the resources directory can be done with a for loop:
$ for service in `ls dse-4.8.0/resources | grep -v driver | grep -v log4j`; do mkdir -p node1/resources/$service && cp -r $DSE_HOME/resources/$service/conf node1/resources/$service; done
Now we repeat this step for every node in our cluster:
$ for node in node1 node2 node3; do for service in `ls dse-4.8.0/resources | grep -v driver | grep -v log4j`; do mkdir -p $node/resources/$service && cp -r $DSE_HOME/resources/$service/conf $node/resources/$service; done; done
After the files are in place, we can make the required changes to the resources/cassandra/conf/cassandra.yaml and resources/cassandra/conf/cassandra-env.sh files on each node to create a working DataStax Enterprise cluster. In these files, configure the cluster name, the interface the node will to bind to, the directory where the node will store its data, the directory where the node will store its logs, and more.
Editing the cluster configuration
To configure parameters like the cluster name, data and log directories, edit the cassandra.yaml file in nodeN/resources/cassandra/conf. Below is a list of the minimum parameters (and their locations) we'll have to set for each node to have a functional DataStax Enterprise cluster of nodes.
Fire up your favourite text editor (by which I mean "fire up emacs"), and let's do it.
cassandra.yaml
- cluster_name: change the cluster name so that the nodes are all part of the same cluster, for example, cluster_name: 'clusty'
- commitlog_directory, data_file_directories, and saved_caches_directory: specify where the node will keep its data, its commit log, and saved caches, for example, commitlog_directory: node1/data/commitlog
- listen_address: The IP address or hostname that Cassandra binds to for connecting to other nodes. Alternatively we could change listen_interface. For example listen_address: 127.0.0.1 for node1, listen_address: 127.0.0.2 for node2, and so on.
- rpc_address: The listen address for client connections (Thrift RPC service and native transport).
- seeds: the list of IP addresses of the seed nodes will go in here
cassandra-env.sh
The only parameter to change here is the port that JMX binds to. For security reasons (see the vulnerability and the Apache Cassandra™ fix) JMX will only bind to localhost so we'll need a separate port per node.
Change the line JMX_PORT="7199" to list a different port for every node, e.g. 7199 for node1, 7299 for node2, and so on.
Note: If you really wanted to bind JMX to an address that is different from localhost, you can use Al Tobey's JMXIPBind. Just follow the instructions there.
logback.xml
The last bit that needs tweaking is the location of the nodes' log directory in resources/cassandra/conf/logback.xml. We'll have to define a property named cassandra.logdir to point to the right location for each node, e.g.
<property name="cassandra.logdir" value="node1/logs/" />
Environment variables
After editing the configuration variables, we're ready to try and start our nodes.
So that DSE can pick up the right configuration files, we'll have to specify the configuration files locations via environment variables.
The first variable to be set is DSE_HOME. In the previous section we saw how to do it, but let's refresh it here:
$ export DSE_HOME=`pwd`/dse-4.8.0
Since we're configuring a homogeneous cluster of nodes running only Apache Cassandra™ workloads, set only the configuration environment variables that are relevant for this workload to point to the files on each node's configuration directory, for NODE=node1:
$ export DSE_CONF=$NODE/resources/dse/conf $ export CASSANDRA_HOME=$NODE/resources/cassandra $ export CASSANDRA_CONF=$CASSANDRA_HOME/conf
The remaining environment variables can be set to the default configuration files:
$ export TOMCAT_HOME=$DSE_HOME/resources/tomcat $ export TOMCAT_CONF_DIR=$TOMCAT_HOME/conf $ export HADOOP_CONF_DIR=$DSE_HOME/resources/hadoop/conf $ export HADOOP_HOME=$DSE_HOME/resources/hadoop $ export HIVE_CONF_DIR=$DSE_HOME/resources/hive/conf $ export SPARK_CONF_DIR=$DSE_HOME/resources/spark/conf
After setting these environment variables, we can start our node:
$ $DSE_HOME/bin/dse cassandra -f
To stop the node, press Control+C.
To start all 3 nodes, we could run the start command without the -f flag, and it would start the process in the background. Then we would need to reset the environment variables to reflect the change in node, e.g. set NODE=node2, re-set the environment variables, and run the command again. But that's not very practical. Not just that, this process can be automated via scripts. Scripts which so that we can start the nodes with a command like:
$ with-dse-env.sh node1 bin/dse cassandra -f
Example scripts
In the previous sections we've outlined the steps necessary to configure a cluster of nodes in a single host. The following example scripts automate the steps that are outlined above.
dense-install.sh
This script copies the relevant configuration files for each node, and edits them according to the description outlined in the previous sections.
In a directory that contains the DataStax Enterprise installation tarball (dse.tar.gz), use it like:
$ path/to/dense-install.sh clusty 3
to create a cluster named clusty that consists of 3 DataStax Enterprise nodes. Keep in mind that the configuration done by the script is minimal (though it'll give you a working cluster). If you want to change anything, make the changes before starting the nodes, for example: enabling encryption, modifying token ranges.
You can download the dense server installation script here.
with-dse-env.sh
This script will set the relevant environment variable values and execute the command requested, for instance to start DataStax Enterprise with an Apache Cassandra™ workload for node1 do:
$ path/to/with-dse-env.sh node1 bin/dse cassandra -f
The script assumes that the current directory (as reported by `pwd`) contains the nodes' configuration files as that were updated by the dense-install.sh script.
You can download the script that sets up the right environment here.
A DataStax Enterprise cluster
Now that you have downloaded the scripts, let's use the scripts to create and start a DataStax Enterprise cluster running Apache Cassandra™ workloads.
Configure the network interfaces
Before anything else, we must ensure our host has a network interface available for each of the nodes in the cluster. In this tutorial we will use virtual network interfaces.
To create the appropriate virtual network interfaces in Linux, use ifconfig. For example:
ifconfig lo:0 127.0.0.2
You must repeat this step for every node in the cluster. If there's 3 nodes, then the first node uses 127.0.0.1, the second node uses 127.0.0.2 (virtual), and the third node uses 127.0.0.3 (virtual as well).
Create the cluster
Creating a new cluster, regardless of the type of workload, is done with the script dense-install.sh:
$ dense-install.sh clusty 3 ~= cluster: clusty, 3 nodes =~ * Unpacking dse.tar.gz... Will install from dse-4.8.0 * Setting up C* nodes... + Setting up node 1... - Copying configs - Setting up the cluster name - Setting up JMX port - Setting up directories - Binding services + Setting up node 2... - Copying configs - Setting up the cluster name - Setting up JMX port - Setting up directories - Binding services + Setting up node 3... - Copying configs - Setting up the cluster name - Setting up JMX port - Setting up directories - Binding services Done.
Starting the cluster
To start each node, we'll run this script. For instance for node1:
$ with-dse-env.sh node1 bin/dse cassandra -f
Use -f to quickly see what's going on with the nodes. Next, we run the same command (in different terminal windows) for the remaining nodes (node2 and node3).
Now, verify that each node is up and running. For example, for node1:
$ ./with-dse-env.sh node1 bin/dsetool ring Address DC Rack Workload Status State Load Owns Token 127.0.0.1 Cassandra rack1 Cassandra Up Normal 156.77 KB 75.15% -3479529816454052534 127.0.0.2 Cassandra rack1 Cassandra Up Normal 139.42 KB 19.71% 156529866135460310 127.0.0.3 Cassandra rack1 Cassandra Up Normal 74.07 KB 5.14% 1105391149018994887
Other workloads
In the previous section, we walked through the steps necessary to create a cluster of DataStax Enterprise nodes running just an Apache Cassandra™ workload. Now we'll create a cluster of DSE Search nodes using the scripts provided as example.
We'll re-use the cluster that we created in the previous section and do some DSE Search specific tweaks only.
Search related configuration changes
On the cluster that we configured in the previous section, you can start the nodes immediately to run with an Apache Cassandra™ workload. To set their node workloads to search nodes we need to do a few small tweaks first (these are not done in the example scripts because they are specific to DSE Search and we wanted to keep the scripts as generic as possible).
server.xml
In DataStax Enterprise versions earlier than 4.8, DSE Search will bind its services to 0.0.0.0 unless we configure a connector with a different IP address (in DataStax Enterprise versions earlier than 4.8, DSE Search binds to the same IP address that Apache Cassandra™ does). If you're running DataStax Enterprise versions earlier than 4.8, in the <Service name="Solr"> section in the resources/tomcat/conf/server.xml file, you should add (for node1):
<Connector port="${http.port}" protocol="HTTP/1.1" address="127.0.0.1" connectionTimeout="20000" redirectPort="8443" />
For the rest of the nodes, you'll need to change the IP address accordingly, that is 127.0.0.2 for node2 and 127.0.0.3 for node3.
Environment variables
In the previous sections, we set node-specific environment variables only for DataStax Enterprise Apache Cassandra™ workloads. For the DSE Search specific environment, set these additional variables:
$ export TOMCAT_HOME=$NODE/resources/tomcat $ export TOMCAT_CONF_DIR=$TOMCAT_HOME/conf
These variables are set in the example scripts, so you don't have to set the variables manually here.
Starting the cluster
To start the nodes with their workload set to search, we need to add the -s flag. For example, for node1:
$ with-dse-env.sh node1 bin/dse cassandra -s -f
After starting all nodes, we can check the nodes are running and that their workload is effectively that of search nodes. For example, for node1:
$ with-dse-env.sh node1 bin/dsetool ring Address DC Rack Workload Status State Load Owns Token 127.0.0.1 Solr rack1 Search Up Normal 119.29 KB 75.15% -3479529816454052534 127.0.0.2 Solr rack1 Search Up Normal 152.1 KB 19.71% 156529866135460310 127.0.0.3 Solr rack1 Search Up Normal 61.39 KB 5.14% 1105391149018994887
Notes and Caveats (sort of a conclusion)
In the above sections, we've outlined how you could set up a cluster of DataStax Enterprise using running an Apache Cassandra™ workload (and a DSE Search workload) nodes, that all run in the same host. We've simplified the setup to keep the tutorial brief, and provided several helper scripts to help you get started trying out dense-node installations in a development environment.
However, before you rush and put this in production, there are several points you should consider:
- Network interfaces: in this tutorial all nodes are bound to the same network interface. In production, however, this configuration provides poor performance. DataStax Enterprise recommends that you have one network adapter per node.
- Disks: just like with network adapters, the nodes are storing their data on the same physical disk. To minimize contention, an alternative is to configure assigned locations for each node so that their data is on different disks. For example, configure different partitions for the commit log, the data, the logs, and so on.
- Replica placement: in terms of fault tolerance, having all replicas of a shard on the same physical host is not a great idea. To have replicas reside on different physical hosts, configure the PropertyFileSnitch so that all shards (taking into the account the replication factor) have copies on different machines:
- distribute your cluster across physical machines, e.g. host1 runs nodes a and b, host2 runs nodes c and d
- configure each node to use the PropertyFileSnitch
- place nodes in host1 as being in rack1, nodes in host2 as being in rack2
- cassandra -stop will stop all nodes in the host; consider using the -p pid option to stop the a specific node (this is left as an exercise to the reader)
- numactl: use numactl --cpunodebind to split multi-socket machines down the middle. In our experience, this configuration provides a significant performance boost compared to interleaved and as a bonus it provides much better isolation since the JVMs will never be run on the same cores, avoiding all manner of performance degrading behavior. You must modify bin/cassandra to override the hard-coded numactl --interleave if the numactl binary is available.