TechnologyJuly 10, 2013

QA starter’s guide to Cassandra

QA starter’s guide to Cassandra

Since the beginning of 2013, we've done a lot of hiring for the Test Engineering organization here at DataStax. During the on-boarding process, I've found myself giving the following primer to our new hires, so I thought I would share the same with you. For many this will be extremely basic, but for newbies my hope is that you will be able to conceptualize and see tangible behavior behind concepts explained in C* documentation. For me, I feel the best way to figure out answers to my C* questions is to run small isolated tests and see how the system reacts.

This was tested on my MacbookPro using the DataStax Community 1.2.6 tarball distribution.

Tips:

  • You can follow along as you read this and replicate the same behavior.
  • Tail the /var/log/cassandra/system.log and pay attention to what happens when you execute each step of this tutorial.
  • Use this as a guide for how to get the most out of C* documentation.

Step 1: Create some data

Reference: CQL3 music playlist example

CREATE KEYSPACE test1 WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
USE test1;

CREATE TABLE playlists (
 id uuid,
 song_order int,
 song_id uuid,
 title text,
 album text,
 artist text,
 PRIMARY KEY (id, song_order) );

INSERT INTO playlists (id, song_order, song_id, title, artist, album)
VALUES (62c36092-82a1-3a00-93d1-46196ee77204, 1, a3e64f8f-bd44-4f28-b8d9-6938726e34d4, 'La Grange', 'ZZ Top', 'Tres Hombres');

SELECT id, song_order, album, artist, title FROM playlists;
 id                         | song_order | album        | artist   | title
----------------------------+------------+--------------+----------+-----------
 62c36092-....-46196ee77204 |          1 | Tres Hombres |   ZZ Top | La Grange

Step 2: Look at the data directories for test1.playlists

Reference: Cassandra Writes

$ ls /var/lib/cassandra
commitlog 
data
saved_caches

$ ls /var/lib/cassandra/data/test1
playlists

$ ls /var/lib/cassandra/data/test1/playlists

... No files, why? ... 

Step 3: Flush the data and look at data again

Reference: nodetool documentation

$ bin/nodetool flush test1

$ ls /var/lib/cassandra/data/test1/playlists
test1-playlists-ic-1-CompressionInfo.db 
test1-playlists-ic-1-Filter.db 
test1-playlists-ic-1-Statistics.db 
test1-playlists-ic-1-TOC.txt
test1-playlists-ic-1-Data.db 
test1-playlists-ic-1-Index.db 
test1-playlists-ic-1-Summary.db

Step 4: Use sstable2json to look at the sstable generated

Reference: How Cassandra Stores Data
Reference: sstable2json documentation

$ bin/sstable2json /var/lib/cassandra/data/test1/playlists/test1-playlists-ic-1-Data.db
[
{"key": "62c3609282a13a0093d146196ee77204","columns": [["1:","",1373439361510000], 
["1:album","Tres Hombres",1373439361510000], 
["1:artist","ZZ Top",1373439361510000], 
["1:song_id","a3e64f8f-bd44-4f28-b8d9-6938726e34d4",1373439361510000], 
["1:title","La Grange",1373439361510000]]}
]

Step 5: Delete the artist column from the row you inserted

Reference: Cassandra Deletes

DELETE artist FROM playlists 
WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204 and song_order = 1;
SELECT id, song_order, album, artist, title FROM playlists;
 id                         | song_order | album        | artist | title
----------------------------+------------+--------------+--------+-----------
 62c36092-....-46196ee77204 |          1 | Tres Hombres |   null | La Grange

Step 6: Run flush to write sstable

$ bin/nodetool flush test1

$ ls /var/lib/cassandra/data/test1/playlists
test1-playlists-ic-1-CompressionInfo.db
test1-playlists-ic-1-Data.db
test1-playlists-ic-1-Filter.db
test1-playlists-ic-1-Index.db
test1-playlists-ic-1-Statistics.db
test1-playlists-ic-1-Summary.db
test1-playlists-ic-1-TOC.txt
test1-playlists-ic-2-CompressionInfo.db
test1-playlists-ic-2-Data.db
test1-playlists-ic-2-Filter.db
test1-playlists-ic-2-Index.db
test1-playlists-ic-2-Statistics.db
test1-playlists-ic-2-Summary.db
test1-playlists-ic-2-TOC.txt

Step 7: Use sstable2json to look at the sstables generated

Note: The version -1 file was not touched since sstables are immutable.

$ bin/sstable2json /var/lib/cassandra/data/test1/playlists/test1-playlists-ic-1-Data.db
[
{"key": "62c3609282a13a0093d146196ee77204","columns": [["1:","",1373439361510000], 
["1:album","Tres Hombres",1373439361510000], 
["1:artist","ZZ Top",1373439361510000], 
["1:song_id","a3e64f8f-bd44-4f28-b8d9-6938726e34d4",1373439361510000], 
["1:title","La Grange",1373439361510000]]}
]

Note: -2 files were created to reflect the deleted column.

$ bin/sstable2json /var/lib/cassandra/data/test1/playlists/test1-playlists-ic-2-Data.db
[
{"key": "62c3609282a13a0093d146196ee77204","columns": [["1:artist","51dd0bc4",1373440964374000,"d"]]}
]

Step 8: Compact the data and see what happens

Note: File versions 1 and 2 were merged to a version 3 file.

$ bin/nodetool compact test1

$ ls /var/lib/cassandra/data/Keyspace1/Standard1 
test1-playlists-ic-3-CompressionInfo.db
test1-playlists-ic-3-Data.db
test1-playlists-ic-3-Filter.db
test1-playlists-ic-3-Index.db
test1-playlists-ic-3-Statistics.db
test1-playlists-ic-3-Summary.db
test1-playlists-ic-3-TOC.txt

$ bin/sstable2json /var/lib/cassandra/data/test1/playlists/test1-playlists-ic-3-Data.db
[
{"key": "62c3609282a13a0093d146196ee77204","columns": [["1:","",1373440786217000], 
["1:album","Tres Hombres",1373440786217000], 
["1:artist","51dd0bc4",1373440964374000,"d"], 
["1:song_id","a3e64f8f-bd44-4f28-b8d9-6938726e34d4",1373440786217000], 
["1:title","La Grange",1373440786217000]]}
]

Step 9: Delete the row, flush the data and look at data again

delete from playlists where id = 62c36092-82a1-3a00-93d1-46196ee77204 and song_order = 1;
select * from playlists;
[no rows returned]

Note: Now there are file versions -3 and -4.

$ bin/nodetool flush test1

$ ls /var/lib/cassandra/data/test1/playlists
test1-playlists-ic-3-CompressionInfo.db
test1-playlists-ic-3-Data.db
test1-playlists-ic-3-Filter.db
test1-playlists-ic-3-Index.db
test1-playlists-ic-3-Statistics.db
test1-playlists-ic-3-Summary.db
test1-playlists-ic-3-TOC.txt
test1-playlists-ic-4-CompressionInfo.db
test1-playlists-ic-4-Data.db
test1-playlists-ic-4-Filter.db
test1-playlists-ic-4-Index.db
test1-playlists-ic-4-Statistics.db
test1-playlists-ic-4-Summary.db
test1-playlists-ic-4-TOC.txt

Note: Contents of the new file version -4 reflects deleted row.

$ bin/sstable2json /var/lib/cassandra/data/test1/playlists/test1-playlists-ic-4-Data.db
[
{"key": "62c3609282a13a0093d146196ee77204","columns": [["1","1:!",1373441493485000,"t",1373441493]]}
]

Step 10: Compact the data and see what happens

Note: Now we have file version -5 after we compact the sstables.

$ bin/nodetool compact test1

$ ls /var/lib/cassandra/data/test1/playlists
test1-playlists-ic-5-CompressionInfo.db
test1-playlists-ic-5-Data.db
test1-playlists-ic-5-Filter.db
test1-playlists-ic-5-Index.db
test1-playlists-ic-5-Statistics.db
test1-playlists-ic-5-Summary.db
test1-playlists-ic-5-TOC.txt

Note: Contents of the new file version -5 look like the -4 file.

$ bin/sstable2json /var/lib/cassandra/data/test1/playlists/test1-playlists-ic-5-Data.db
[
{"key": "62c3609282a13a0093d146196ee77204","columns": [["1","1:!",1373441493485000,"t",1373441493]]}
]

Step 11: Change gc_grace_seconds so that we may remove the tombstone

Reference: Deletes

The purpose of this section is highlight gc_grace_seconds. Newer versions of Cassandra filter out range ghosts so that you won't see tombstone records after delete (a row key with no columns).
One of my favorite blog posts is related to tombstones and data modeling: Cassandra anti-patterns: Queues and queue-like datasets

use test1;
cqlsh:test1> alter table playlists with gc_grace_seconds = 1;

Note: After compaction all files are gone because we removed the tombstones.

$ bin/nodetool compact test1

$ ls /var/lib/cassandra/data/test1/playlists
... no files found ....

WARNING: Never set gc_grace_seconds this low or else previously deleted data may reappear via repair if a node was down while tombstones are removed.

GC Grace

GC Grace Without Tombstones

 

Step 12: Create an index, flush the data and look at data again

Generate another row of data in cqlsh:

INSERT INTO playlists (id, song_order, song_id, title, artist, album)
  VALUES (72c36092-82a1-3a00-93d1-46196ee77204, 1, c7e64f8f-bd44-4f28-b8d9-6938726e34d4, 
  'Brews', 'Branford Marsalis', 'Four MFs Playing Tunes');

CREATE INDEX ON playlists(artist);

SELECT song_order, album, artist, title 
FROM playlists 
WHERE artist = 'Branford Marsalis';
  song_order | album                  | artist            | title
-------------+------------------------+-------------------+-------
          1 | Four MFs Playing Tunes | Branford Marsalis  | Brews

Note: There are now *_idx-* in the data directory.

$ ls /var/lib/cassandra/data/test1/playlists
test1-playlists-ic-7-CompressionInfo.db
test1-playlists-ic-7-Data.db
test1-playlists-ic-7-Filter.db
test1-playlists-ic-7-Index.db
test1-playlists-ic-7-Statistics.db
test1-playlists-ic-7-Summary.db
test1-playlists-ic-7-TOC.txt
test1-playlists.playlists_artist_idx-ic-1-CompressionInfo.db
test1-playlists.playlists_artist_idx-ic-1-Data.db
test1-playlists.playlists_artist_idx-ic-1-Filter.db
test1-playlists.playlists_artist_idx-ic-1-Index.db
test1-playlists.playlists_artist_idx-ic-1-Statistics.db
test1-playlists.playlists_artist_idx-ic-1-Summary.db
test1-playlists.playlists_artist_idx-ic-1-TOC.txt

Look at the secondary index -Data file:

$ bin/sstable2json /var/lib/cassandra/data/test1/playlists/test1-playlists.playlists_artist_idx-ic-1-Data.db
[
{"key": "4272616e666f7264204d617273616c6973","columns": [["72c3609282a13a0093d146196ee77204:1","",1373442382363000]]}
]

Look at the secondary index -Index file:

$ bin/sstable2json /var/lib/cassandra/data/test1/playlists/test1-playlists.playlists_artist_idx-ic-1-Index.db
[
{"key": "4272616e666f7264204d617273616c6973","columns": [["72c3609282a13a0093d146196ee77204:1","",1373442382363000]]}
]

Look at the data file:

$ bin/sstable2json /var/lib/cassandra/data/test1/playlists/test1-playlists-ic-7-Data.db
[
{"key": "72c3609282a13a0093d146196ee77204","columns": [["1:","",1373442382363000], 
["1:album","Four MFs Playing Tunes",1373442382363000], 
["1:artist","Branford Marsalis",1373442382363000], 
["1:song_id","c7e64f8f-bd44-4f28-b8d9-6938726e34d4",1373442382363000], 
["1:title","Brews",1373442382363000]]}
]

Step 13: Drop the Keyspace - notice we generate a snapshot and leave the directory in place

Note: if you delete a Keyspace and then recreate the same keyspace and column family, you may notice your data come back. You may want to truncate first if you really want to be squeaky clean.

drop keyspace test1;

Check data directory:

$ ls /var/lib/cassandra/data/test1/playlists
snapshots

It's ok to remove the directory and snapshot. Restart server and see for yourself: :)

$ rm -rf /var/lib/cassandra/data/test1/playlists

Step 14: Change memtable_total_space_in_mb to force flushing of memtables

We will use cassandra-stress to illustrate this example since it is very easy to create a large sized column that will flush automatically.
Reference: Cassandra Operations

Change cassandra.yaml:

memtable_total_space_in_mb: 1

WARNING: Never set this value so low, it is only meant for illustration purposes.

Run stress with 1MB column size:

$ tools/bin/cassandra-stress -n 1 -S 1048576
Created keyspaces. Sleeping 1s for propagation.
total,interval_op_rate,interval_key_rate,latency/95th/99th,elapsed_time
1,0,0,39.6,39.6,39.6,0
END

$ ls /var/lib/cassandra/data/Keyspace1/Standard1 
Keyspace1-Standard1-ic-1-Data.db
Keyspace1-Standard1-ic-1-Digest.sha1
Keyspace1-Standard1-ic-1-Filter.db
Keyspace1-Standard1-ic-1-Index.db
Keyspace1-Standard1-ic-1-Statistics.db
Keyspace1-Standard1-ic-1-Summary.db
Keyspace1-Standard1-ic-1-TOC.txt

Note: This file is huge, so run sstablekeys to get a list of keys in the file instead of sstable2json.

$ bin/sstablekeys /var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-ic-1-Data.db
30

Summary

I hope this was helpful for those new to Cassandra, and provided a small tour of key concepts. This was illustrated using a single row of data, but imagine how dynamic the system becomes under heavy writes that generate many sstables and triggers lots of compaction activity.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.