Getting Started with Apache Cassandra
If you haven't begun using Apache Cassandra yet and you wanted a little handholding to help get you started, you're in luck. This article will help you get your feet wet with Cassandra and show you the basics so you'll be ready to start developing Cassandra applications in no time.
Why Cassandra?
Do you need a more flexible data model than what's offered in the relational database world? Would you like to start with a database you know can scale to meet any number of concurrent user connections and/or data volume size and run blazingly fast? Have you been needing a database that has no single point of failure and one that can easily distribute data among multiple geographies, data centers, and the cloud? Well, that's Cassandra.
Step 1 – Installing Cassandra
In this article, we'll show you how to kick the tires of Cassandra on a single machine, but note that it's also very easy to configure a multi-node, clustered setup, which is what allows Cassandra to really flex its muscles where scale and performance are concerned. The first step is to download and install Cassandra on your target test machine. To download Cassandra, go to www.datastax.com/download and select the DataStax Community Edition, which includes the most up-to-date, stable version of Cassandra, the Cassandra Query Language (CQL) interface, and a free version of DataStax OpsCenter, which is a web-based management and monitoring solution for Cassandra, and a sample Cassandra application. For this exercise, choose the Tarball option for the version of the operating system you're using (either Linux or Mac). You'll want to download the Datastax Community Edition, which includes the database server, the CQL (Cassandra Query Language) shell and more. For now, don't worry about downloading DataStax OpsCenter, as we'll cover that in another article. Once your download of Cassandra finishes, move the file to whatever directory you'd like to use for testing Cassandra. Then uncompress the file:
tar -xzf dsc-cassandra-1.2.2-bin.tar.gz
Then switch to the new Cassandra bin directory and start up Cassandra:
robinsmac:dev robin$ cd dsc-cassandra-1.2.2/bin robinsmac:bin robin$ sudo ./cassandra robinsmac:bin robin$ INFO 14:49:57,739 Logging initialized INFO 14:49:57,750 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.6.0_35 INFO 14:49:57,750 Heap size: 2093809664/2093809664 INFO 14:49:57,751 Classpath: . . INFO 14:49:59,208 Completed flushing /var/lib/cassandra/data/system/schema_columns/system-schema_columns-ib-2-Data.db (210 bytes) for commitlog position ReplayPosition(segmentId=1362167398602, position=53130)
Step 2 – Connecting to Cassandra
Now that you have Cassandra running, the next thing to do is connect to the server and begin creating database objects. This is done with the Cassandra Query Language (CQL) utility. CQL is a very SQL-like language that lets you create objects as you’re likely used to doing in the RDBMS world. The CQL utility (cqlsh) is in the same bin directory as the cassandra executable:
robinsmac:bin robin$ ./cqlsh Connected to Test Cluster at localhost:9160. [cqlsh 2.3.0 | Cassandra 1.2.2 | CQL spec 3.0.0 | Thrift protocol 19.35.0] Use HELP for help. cqlsh>
Now you're ready to start creating Cassandra keyspaces and data objects. The nice thing about Cassandra is that the CQL language makes it very easy to get started for anyone coming from legacy relational databases (and that's probably you and most everyone you know). CQL is very much like SQL, so the learning curve with Cassandra is practically non-existent where creating objects, manipulating data, and querying data is concerned.
Step 3 – Creating a Keyspace
Cassandra has the concept of a keyspace, which is similar to a database in a RDBMS. A keyspace is what holds data objects and is the level where you specify options for a data partitioning and replication strategy. For this brief introduction, we'll just create a basic keyspace to hold the data objects we'll create:
cqlsh> create keyspace dev ... with replication = {'class':'SimpleStrategy','replication_factor':1};
Note that you can have multiple keyspaces in a Cassandra server/cluster, so when you're ready to start creating objects, you need to use the USE command to tell Cassandra which keyspace you want to work with.
Step 4 – Creating Data Objects
Now that you have a keyspace created, it's time to create a data object to store data. Because Cassandra is based on Google Bigtable, you'll use column families / tables to store data. Column families are similar to RDBMS tables, but are much more flexible and dynamic. Column families have rows like RDBMS tables, but they are a sparse column type of object, meaning that rows in a column family can have different columns depending on the data you want to store for a particular row. Let's create a base column family to hold employee data:
cqlsh> use dev; cqlsh:dev> create table emp (empid int primary key, ... emp_first varchar, emp_last varchar, emp_dept varchar); cqlsh:dev>
The column family is named emp and contains four columns, including the employee ID, which acts as the primary key of the table. Note that a column family must have a primary key that’s used for initial query activity.
Step 5 – Inserting and Querying Data
Let's now go ahead and insert data into our new column family using the CQL INSERT command:
cqlsh:dev> insert into emp (empid, emp_first, emp_last, emp_dept) ... values (1,'fred','smith','eng');
Notice how Cassandra’s CQL is literally identical to the RDBMS INSERT command. Other DML statements are as well:
cqlsh:dev> update emp set emp_dept = 'fin' where empid = 1;
Querying data uses the familiar SELECT statement:
cqlsh:dev> select * from emp; empid | emp_dept | emp_first | emp_last ------+----------+-----------+---------- 1 | fin | fred | smith
However, look what happens when you try to use a WHERE predicate and reference a non-primary key column:
cqlsh:dev> select * from emp where empid = 1; empid | emp_dept | emp_first | emp_last ------+----------+-----------+---------- 1 | fin | fred | smith cqlsh:dev> select * from emp where emp_dept = 'fin'; Bad Request: No indexed columns present in by-columns clause with Equal operator
In Cassandra, if you want to query columns other than the primary key, you need to create a secondary index on them:
cqlsh:dev> create index idx_dept on emp(emp_dept); cqlsh:dev> select * from emp where emp_dept = 'fin'; empid | emp_dept | emp_first | emp_last ------+----------+-----------+---------- 1 | fin | fred | smith
There's more you can do with the SELECT command, and for more information, please see the online DataStax CQL reference.
Conclusion
We've reached the end for this short article on how to get started with Cassandra. Hopefully, you now have a basic feel for how to install, create objects, manipulate data, and query data in Cassandra. To download either the DataStax Community or Enterprise editions, please visit the DataStax downloads page at www.datastax.com/download.
For More Information
To get a good overview of Cassandra and its architecture, read the Introduction to Apache Cassandra white paper. To learn more about CQL, as well as about setting up a multi-node Cassandra cluster, see the DataStax online documentation for Apache Cassandra 1.2. Also visit the Planet Cassandra blog for more articles, technical blog posts, videos, and more.