What is data persistence & why does it matter?
Understanding the meaning of persistence is important for evaluating different data store systems. Given the importance of the data store in most modern applications, making a poorly informed choice could mean substantial downtime or loss of data. In this post, we'll discuss persistence and data store design approaches and provide some background on these in the context of Cassandra.
If you’re interested in learning more about persistence in Cassandra and other NoSQL databases, check out our complete guide to NoSQL.
What is data persistence?
Persistence is "the continuance of an effect after its cause is removed". In the context of storing data in a computer system, this means that the data survives after the process with which it was created has ended. In other words, for a data store to be considered persistent, it must write to non-volatile storage.
Which data stores provide persistence?
If you need persistence in your data store, then you need to also understand the four main design approaches that a data store can take and how (or if) these designs provide persistence:
- Pure in-memory, no persistence at all, such as memcached or Scalaris
- In-memory with periodic snapshots, such as Oracle Coherence or Redis
- Disk-based with update-in-place writes, such as MySQL ISAM or MongoDB
- Commitlog-based, such as all traditional OLTP databases (Oracle, SQL Server, etc.)
In-memory approaches can achieve blazing speed, but at the cost of being limited to a relatively small data set. Most workloads have relatively small "hot" (active) subset of their total data; systems that require the whole dataset to fit in memory rather than just the active part are fine for caches but a bad fit for most other applications. Because the data is in memory only, it will not survive process termination. Therefore these types of data stores are not considered persistent.
Adding persistence to systems
The easiest way to add persistence to an in-memory system is with periodic snapshots to disk at a configurable interval. Thus, you can lose up to that interval's worth of updates.
Update-in-place and commitlog-based systems store to non-volatile memory immediately, but only commitlog-based persistence provides Durability -- the D in ACID -- with every write persisted before success is returned to the client.
Cassandra implements a commit-log based persistence design, but at the same time provides for tunable levels of durability. This allows you to decide what the right trade off is between safety and performance. You can choose, for each write operation, to wait for that update to be:
- buffered to memory
- written to disk on a single machine
- written to disk on multiple machines
- written to disk on multiple machines in different data centers
Or, you can choose to accept writes as quickly as possible, acknowledging their receipt immediately before they have even been fully deserialized from the network.
Why data persistence matters
At the end of the day, you're the only one who knows what the right performance/durability trade off is for your data. Making an informed decision on data store technologies is critical to addressing this tradeoff on your terms. Because Cassandra provides such tunability, it is a logical choice for systems with a need for a durable, performant data store.