Multi-datacenter Replication in Cassandra
DataStax Enterprise's heavy usage of Cassandra's innate datacenter concepts are important as they allow multiple workloads to be run across multiple datacenters. This occurs on near real-time data without ETL processes or any other manual operations. In most setups, this is handled via "virtual" datacenters that follow Cassandra's internals for datacenters, while the actual hardware exists in the same physical datacenter. Over the course of this blog post, we will cover this and a couple of other use cases for multiple datacenters.
Workload Separation
By implementing datacenters as the divisions between varying workloads, DataStax Enterprise allows a natural distribution of data from real-time datacenters to near real-time analytics and search datacenters.
Whenever a write comes in via a client application, it hits the main Cassandra datacenter and returns the acknowledgment at the current consistency level (typically less than LOCAL_QUORUM, to allow for a high throughput and low latency). In parallel and asynchronously, these writes are sent off to the Analytics and Solr datacenters based on the replication strategy for active keyspaces. A typical replication strategy would look similar to {Cassandra: 3, Analytics: 2, Solr: 1}, depending on use cases and throughput requirements.
Once these asynchronous hints are received on the additional clusters, they undergo the normal write procedures and are assimilated into that datacenter. This way, any analytics jobs that are running can easily and simply access this new data without an ETL process. For DSE's Solr nodes, these writes are introduced into the memtables and additional Solr processes are triggered to incorporate this data.
Live Backup Scenario
Some Cassandra use cases instead use different datacenters as a live backup that can quickly be used as a fallback cluster. We will cover the most common use case using Amazon's Web Services Elastic Cloud Computing (EC2) in the following example.
Specific Use Case
Allow your application to have multiple fallback patterns across multiple consistencies and datacenters.
Example
A client application was created and currently sends requests to EC2's US-East-1 region at a consistency level (CL) of LOCAL_QUORUM. Later, another datacenter is added to EC2's US-West-1 region to serve as a live backup. The replication strategy can be a full live backup ({US-East-1: 3, US-West-1: 3}) or a smaller live backup ({US-East-1: 3, US-West-1: 2}) to save costs and disk usage for this regional outage scenario. All clients continue to write to the US-East-1 nodes by ensuring that the client's pools are restricted to just those nodes, to minimize cross datacenter latency.
To implement a better cascading fallback, initially the client's connection pool will only be aware of all nodes in the US-East-1 region. In the event of client errors, all requests will retry at a CL of LOCAL_QUORUM, for X times, then decrease to a CL of ONE while escalating the appropriate notifications. If the requests are still unsuccessful, using a new connection pool consisting of nodes from the US-West-1 datacenter, requests should begin contacting US-West-1 at a higher CL, before ultimately dropping down to a CL of ONE. Meanwhile, any writes to US-West-1 should be asynchronously tried on US-East-1 via the client, without waiting for confirmation and instead logging any errors separately.
For cases like this, natural events and other failures can be prevented from affecting your live applications.
Recovery
As long as the original datacenter is restored within gc_grace_seconds (10 days by default), perform a rolling repair (without the -pr option) on all of its nodes once they come back online.
If, however, the nodes will be set to come up and complete the repair commands after gc_grace_seconds, you will need to take the following steps in order to ensure that deleted records are not reinstated:
- remove all the offending nodes from the ring using `nodetool removetoken`,
- clear all their data (data directories, commit logs, snapshots, and system logs),
- decrement all tokens in this region,
- disable auto_bootstrap,
- start up the entire region,
- and, finally, run a rolling repair (without the -pr option) on all nodes in the other region.
After these nodes are up to date, you can restart your applications and continue using your primary datacenter.
Geographical Location Scenario
There are certain use cases where data should be housed in different datacenters depending on the user's location in order to provide more responsive exchange. The use case we will be covering refers to datacenters in different countries, but the same logic and procedures apply for datacenters in different regions. The logic that defines which datacenter a user will be connected to resides in the application code.
Specific Use Case
Have users connect to datacenters based on geographic location, but ensure this data is available cluster-wide for backup, analytics, and to account for user travel across regions.
Example
The required end result is for users in the US to contact one datacenter while UK users contact another to lower end-user latency. An additional requirement is for both of these datacenters to be a part of the same cluster to lower operational costs. This can be handled using the following rules:
- When reading and writing from each datacenter, ensure that the clients the users connect to can only see one datacenter, based on the list of IPs provided to the client.
- If doing reads at QUORUM, ensure that LOCAL_QUORUM is being used and not EACH_QUORUM since this latency will affect the end user's performance experience.
- Depending on how consistent you want your datacenters to be, you may choose to run repair operations (without the -pr option) more frequently than the required once per gc_grace_seconds.
Benefits
When setting up clusters in this manner:
- You ensure faster performance for each end user.
- You gain multi-region live backups for free, just as mentioned in the section above. Granted, the performance for requests across the US and UK will not be as fast, but your application does not have to hit a complete standstill in the event of catastrophic losses.
- Users can travel across regions and in the time taken to travel, the user's information should have finished replicating asynchronously across regions.
Caveats
- Always use the NetworkTopologyStrategy when using multiple datacenters. SimpleStrategy will work across datacenters, but has the following disadvantages:
- It is harder to maintain.
- Does not provide any of the features mentioned above.
- Introduces latency on each write (depending on datacenter distance and latency between datacenters).
- Defining one rack for the entire cluster is the simplest and most common implementation. Multiple racks should be avoided for the following reasons:
- Most users tend to ignore or forget rack requirements that state racks should be in an alternating order to allow the data to get distributed safely and appropriately.
- Many users are not using the rack information effectively by using a setup with as many racks as they have nodes, or similar non-beneficial scenarios.
- When using racks correctly, each rack should typically have the same number of nodes.
- In a scenario that requires a cluster expansion while using racks, the expansion procedure can be tedious since it typically involves several node moves and has has to ensure to ensure that racks will be distributing data correctly and evenly. At times when clusters need immediate expansion, racks should be the last things to worry about.