Handling Disk Failures In Cassandra 1.2

Cassandra is great at handling entire node failures. It's not just robust, it's almost indestructible.

But until Cassandra 1.2, a single unavailable disk has the potential to make the whole replica unresponsive, while still technically alive and part of the cluster: memtables will be unable to flush and the node will eventually run out of memory. Commitlog append may also fail if you happen to lose the commitlog disk.

The traditional workaround has been to deploy on raid10 volumes, but as Cassandra handles increasingly large data volumes the prospect of paying an extra 50% space penalty on top of Cassandra's own replication is becoming unpalatable.

The upcoming Cassandra 1.2 release (currently in beta) fixes both of these issues by introducing a disk_failure_policy setting that allows you to choose from two policies that deal with disk failure sensibly: best_effort and stop. Here is how these work:

stop is the default behavior for new 1.2 installations. Upon encountering a file system error Cassandra will shut down gossip and Thrift services, leaving the node effectively dead, but still inspectable via JMX for troubleshooting.

best_effort Cassandra will do its best in the face of disk errors: if it can't write to a disk, the disk will become blacklisted for writes and the node will continue writing elsewhere; if Cassandra can't read from a disk, it will be marked as unreadable, and the node will continue serving data from readable sstables only. This implies that it's possible for stale data to be served when the most recent version was on the disk that is no longer accessible and consistency level is ONE, so choose this option with care. This allows you to get the most out of your disks.

An ignore policy also exists for upgrading users. In this mode Cassandra will behave in the exact same manner as 1.1 and older versions did - all file system errors will logged but otherwise ignored. DataStax recommends users opt in to stop or best_effort instead.

Summary

Starting with version 1.2, Cassandra will be able to properly react to a disk failure - either by stopping the affected node or by blacklisting the failed drive, depending on your availability/consistency requirements. This allows deploying Cassandra nodes with large disk arrays without the overhead of raid10.