Apache Cassandra and Timeseries: BFF's?
One of the most common use cases when dealing with Apache Cassandra® are timeSeries. After introducing the concept of Time series in a few words Amanda and Cedrick will analyze why Cassandra got so much traction and detail what we see at customers, what are the pitfalls and what are today’s challenges.
Highlights!
Amanda: Welcome to this Episode of the DIstributed Datashow I am Cedrick and this is my coworker Cedrick -- all the way from Paris! This is our first DDS episode together! Hopefully, next time we film I will get to go to Paris!
Cedrick: Hi, i’m so glad to be here, BTW we are recording this episode from sunny Florida today. That’s awesome
Amanda: I know today we wanted to discuss Time Series! Cedrick --with all his other duties as a Developer Advocate-- has also been working a lot with our customers on different time series applications. So maybe first could you remember what TimeSeries are ?
Cedrick: Sure. Simply put this a sequence of numerical data points, values in a successive order and this order is time. Most of natural phenomenon could be described in that way. Choose one source and measure the values over time.
Amanda: I can think of use cases in today IT, logs, stocks, sensors, events, logs all those are time series. Ok but why Cassandra ?
Cedrick: Amanda, if you meet someone at an event who asked you about Apache Cassandra in 1 minute what would you say ? Personally I tell it is a distributed database that means multiple nodes. On each node 1TB of data, about 3K/tx/s/cpu. If you need more capacity add nodes, if you need more throughput add nodes. And here is think this is the key point not the data replication for resiliency. More throughput ? Tunable consistency can also help put some CL=ONE and you speed up.
Amanda : Throughput is key, number of events increases exponentially, 5G will come. So with Cassandra with easy write the data at good pace. What about reading data then ? I would like to be able to graph, charts, aggregation, show trends, both coarse and fine grained charts
Part II - Data Modelling
Cedrick : Haha success of a Cassandra project is all about data modelling. When you graph a chart for a dedicated stock or dedicated sensor it is like multiple data points for a single entity as such the entity identifier is a good candidate for partition key. It will be evenly distributed in the cluster and you read a single partition to graph.
Amanda : wait. If I don’t have a lot of sensors I will always hit the same node (hot spot). If I have a lot of data I will hit the partition size limit (100MB or 2billions cells).
Cedrick:
- Efficient Storage Model
- Bucketing (Stored on disk in sequential format)
Amanda : good. But what about aggregation, you said before you would like to chart both fine and coarse grained charts. You have a lot of computation to do for aggregations.
Cedrick:
- Data duplication with different steps
- Rollups
- Data compression (LZ4)
- TTL
- Some queries for warm data