Dynamic snitching in Cassandra: past, present, and future
Dynamic snitching is something we've done in Cassandra as far back as 0.6.5, but can still be a source of confusion for many. In this post, I'm going to do my best to demystify everything and anything you could want to know about it, as well as preview some of the improvements that have been made for our upcoming 1.2 release later this year.
To begin, let's first answer the most obvious question: what is dynamic snitching? To understand this, we'll first recall what a snitch does. A snitch's function is to determine which datacenters and racks are both written to and read from. So, why would that be 'dynamic?' This comes into play on the read side only (there's nothing to be done for writes since we send them all and then block to until the consistency level is achieved.) When doing reads however, Cassandra only asks one node for the actual data, and, depending on consistency level and read repair chance, it asks the remaining replicas for checksums only. This means that it has a choice of however many replicas exist to ask for the actual data, and this is where the dynamic snitch goes to work.
Since only one replica is sending the full data we need, we need to chose the best possible replica to ask, since if all we get back is checksums we have nothing useful to return to the user. The dynamic snitch handles this task by monitoring the performance of reads from the various replicas and choosing the best one based on this history.
In a modern Cassandra world however, read repair chance is lowered, since hints were made reliable. So, given this, can't we maximize our cache capacity a bit when our consistency level is ONE? Yes! This is exactly what the badness threshold parameter to the dynamic snitch is designed to do. What this parameter defines is, as a percentage, how much worse the natural first replicamust perform, in order to switch to a different one. Thus given replicas X, Y, and Z, the X replica will be preferred until it performs badness_threshold worse than Y or Z. This means that when everything is healthy, the cache capacity is maximized across the nodes, but if things gets worse, specifically badness_threshold worse, then Cassandra will continue to provide availability by using the other replicas at its disposal. The dynamic snitch doesn't determine replica placement itself, though, that is what your chosen snitch does. The dynamic snitch simply wraps your snitch, and provides this adaptive behavior on reads. How is this accomplished? Originally, the dynamic snitch was modeled in a similar fashion to the failure detector, since the failure detector was also adaptive. It is fed latency information from reads from the nodes, and chooses which node is performing the fastest based on that information. Concessions had to be made though; if it spent too much CPU time calculating which host was best, that would become counterproductive, since it would be sacrificing overall read throughput, which is often CPU-bound. So, it adopted a two-pronged approach, and had two separate operations. One was cheap (receiving the updates) and one was more expensive (calculating the scores for each host.) Both were mitigated; the cheaper update phase had a maximum cap of updates that it would accept, and the calculation operation only ran every so often.
By default, the score calculation is set to a fairly reasonable 100 milliseconds. The updates are capped at a maximum of 10,000 per scoring interval, but this introduces some new problems and questions. First off, if we don't read from a host because we determine it's no longer performing sufficiently, how can we know that it has recovered? Without incoming information to influence this, we have to add a new rule, which (again by default) is to reset all the scores every 10 minutes. This way, we'll have to sample a few reads after every reset interval, but it gives each replica a fair shot once again. Secondly, is 10,000 a good number? We can't really know, since this could very well be dependent on the power of the machine Cassandra is running on. And finally, is latency the only and best source for deciding which machine to read data from?
In the next release of Cassandra, these latter two problems have been addressed. Instead of sampling a fixed amount of updates, we now use a statistically significant random sample, and weight more recent information heavier than past information. But wait, there's more! Instead of relying purely on latency information from reads, we now also consider other factors, like whether or not a node is currently doing a compaction, since that can often penalize reads.
There is one final wrinkle left, however. In the absence of information, the dynamic snitch can't react. The implication of this is that it needs to hit the rpc_timeout in order to get the 'this host is bad now' message, but we don't want to have to wait that long to choose a new replica. How can we respond without anything to respond to? To solve this, we have to examine the broader scope of the situation. In actuality, we dohave a signal we can respond to, and that is time itself. Thus, if a node suddenly becomes a black hole, we'll only throw reads at it for one scoring interval, and when the next score is calculated we'll consider latency, the node's state (called a severity factor) and how long it has been since the node last replied, penalizing it so that we stop trying to read from it (badness_threshold permitting.)
Hopefully this post has explained more thoroughly the available settings for the dynamic snitch, why they exist, as well as giving you a better understanding of how it works in general. Dynamic snitching has long been a technique Cassandra has used to improve read performance, but has never been explained in great detail until now.