TechnologyApril 17, 2019

Simplifying Installing Apache Cassandra, Apache Spark and Jupyter

Simplifying Installing Apache Cassandra, Apache Spark and Jupyter

So you want to experiment with Apache Cassandra and Apache Spark to do some Machine Learning, awesome! But there is one downside, you need to create a cluster or ask to borrow someone else's to be able to do your experimentation… but what if I told you there is a way to install everything you need on one node, even on your laptop (if you are using Linux of Mac!). The steps outlined below will install:

  • Apache Cassandra
  • Apache Spark
  • Apache Cassandra - Apache Spark Connector
  • PySpark
  • Jupyter Notebooks
  • Cassandra  Python Driver

Note: With any set of install instructions it will not work in all cases. Each environment is different. Hopefully, this works for you (as it did for me!), but if not use this as a guide. Also, feel free to reach out and add comments on what worked for you!

Cassandra Jupyter Python Apache Spark Logo

Installing Apache Cassandra

Cassandra Logo

Download bits

http://cassandra.apache.org/download/

Untar and Start

http://cassandra.apache.org/doc/latest/getting_started/installing.html

tar -xzvf apache-cassandra-x.x.x.tar

.//apache-cassandra-x.x.x/bin/cassandra //This will start Cassandra

You might want to add .//apache-cassandra-x.x.x/bin to your PATH but this is not required.

Using all defaults in this case. For more information about non default configurations review the the Apache Cassandra documentation. 

Create a Keyspace and Table with CQLSH

We will use this keyspace and table later to validate the connection between Apache Cassandra and Apache Spark.

   .//apache-cassandra-x.x.x/bin/cqlsh
CREATE KEYSPACE IF NOT EXISTS test

   WITH REPLICATION =
   { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
CREATE TABLE IF NOT EXISTS testing123 (id int, name text, city text, PRIMARY KEY (id));
INSERT INTO testing123 (id, name, city) VALUES (1, 'Amanda', 'Bay Area');
INSERT INTO testing123 (id, name, city) VALUES (2, 'Toby', 'NYC');

CQLSH Code

Install Apache Spark in Standalone Mode

Apache Spark Logo

Download bits:

https://spark.apache.org/downloads.html

Install:

https://spark.apache.org/docs/latest/spark-standalone.html#installing-spark-standalone-to-a-cluster

tar spark-x.x.x-bin-hadoopx.x.tar

Before starting Spark do the following:

export SPARK_HOME=”//spark-x.x.x-bin-hadoopx.x
cd $SPARK_HOME/conf

vim spark-defaults.conf
//Add line spark.jars.packages
Spark.jars.packages     com.datastax.spark:spark-cassandra-connector_2.11:2.3.2

Spark Configuration

Start Spark In Standalone Mode

cd spark-x.x.x-bin-haoopx.x/
./sbin/start-master.sh

Information about the Apache Spark Connector

The Apache Cassandra and Apache Spark Connector works to move data back and forth from Apache Cassandra to Apache Spark to utilize the power for Apache Spark on the data. This should be co-located with Apache Cassandra and Apache Spark on both on the same node.The connector will gather data from Apache Cassandra and its known token range and page that into the Spark Executor. The connector utilized the DataStax Java driver under the hood to move data between Apache Cassandra and Apache Spark. More information can be found here: https://databricks.com/session/spark-and-cassandra-2-fast-2-furious

Query With Token Bounds

Note: Just working with PySpark in this case, and only DataFrames are available.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15...

https://spark-packages.org/package/datastax/spark-cassandra-connector

 

Test the connection out first -- Using that keyspace and table we created above

.$SPARK_HOME/bin/pyspark
#Create a dataframe from a table that we created above

spark.read.format('org.apache.spark.sql.cassandra').options(table='testing123', keyspace='test').load().show()

Spark Connector

Install Jupyter Notebooks with pip

Reference: https://jupyter.org/install

python -m pip install --upgrade pip

python -m pip install jupyter

Start Jupyter with PySpark

cd spark-2.3.0-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
SPARK_LOCAL_IP=127.0.0.1 ./bin/pyspark

These commands will launch Jupyter Notebooks on localhost:8888, the downside is if you have existing notebooks you won't be able to navigate to them... but just copy them here ... Not the best solution but it will do to be able to use all these pieces together!

Local Host

Install Apache Cassandra Python Driver

pip install cassandra-driver

Create a New Notebook

Import Packages

Import cassandra
Import pyspark

Connect to Cluster

from cassandra.cluster import Cluster
cluster = Cluster(['127.0.01'])

session = cluster.connect()

Create SparkSession and load the dataframe from the Apache Cassandra table. Verify transfer has occurred by printing the number of rows in the dataframe. We should see “2”

spark = SparkSession.builder.appName('demo').master("local").getOrCreate()
df = spark.read.format("org.apache.spark.sql.cassandra").options(table="testing123", keyspace="test").load()
print ("Table Row Count: ")
print (df.count())

Jupyter Testing

Conclusion

TADA!

There you have it! You now have Apache Cassandra, Apache Spark, Apache Cassandra-Apache Spark connector, Pyspark, Cassandra Python driver and Jupyter all installed on one node (or local instance!) Congratulations! Enjoy exploring your data!

A few notebooks you might enjoy!

https://github.com/amandamoran/wineAndChocolate

https://github.com/amandamoran/pydata

Reference: https://medium.com/explore-artificial-intelligence/downloading-spark-and...

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.