CompanyOctober 31, 2018

When Rotten Tomatoes Isn't Enough: Twitter Sentiment Analysis with DSE Part 2

When Rotten Tomatoes Isn't Enough: Twitter Sentiment Analysis with DSE Part 2

Part 2 of this blog series will focus on how to get DataStax Enterprise Analytics with Apache Cassandra™ and Apache Spark™, Jupyter Notebooks, and all the required Python package dependencies set up via Docker.

Docker Logos

What Problem Are We Trying to Solve?

The question of our time: "What movie should I actually see?" Wouldn't it be great if you could ask 1 million people this question? Wouldn't it be great if I could automate this process? And wouldn't it be great if I didn't have to do all the installation steps detailed out in the Part 1 blog in this series?

Data analytics doesn't have to be complicated and neither does the step-up!

To do this we can use the power of Big Data, and power of a combination of technologies: DataStax Enterprise Analytics with Apache Spark™ and Apache Cassandra™, Apache Spark™ Machine Learning Libraries, Python, Pyspark, Twitter Tweets, Twitter Developer API, Jupyter notebooks, Pandas, a Python package Pattern, and Docker!

How Are We Going to Solve It?

In the Part 1 blog entry on this topic we installed everything locally using the DSE binary tar file to install, but now we will simplify the process by utilizing Docker and a previously created image. 

How to Get Started

Requirements

  • Docker
  • Download or clone this repo: https://github.com/amandamoran/pydata
    • Note: This repo also includes notebooks utlizing CSV files if you would like to get started with the notebook but do not wish to create a Twitter Dev API account.

Overview

  • Install Docker
  • Download DataStax Docker Image
  • Open Jupyter

Install Docker

 

Download Docker

Configure Docker Memory Settings

  • Allow for 5 GB of Memory per container
  • Docker -> Preferences -> Advanced -> Memory

 

Docker Memory Settings

Download DSE/Jupyter Images

  • cd YourDownloadPath/pydata
  • docker-compose up -d 
    • This will take about 6 minutes (depending on your connection speed)
    • This will start DataStax Enterprise which includes Apache Spark™, and Jupyter notebooks
    • Must run in the same directory as the docker-compose.yaml file (This file is what has all the configuration and information on how to download and deploy these containers.)

Open Jupyter

  • Once download and start is complete
  • Login with token that is in Jupyter logs
  • docker logs pydata_jupyter_1

 

Open Jupyter

Notebook

Congrats, you did it!

What's Next:

Explore the notebook! Play with removing different stop words, change the confidence intervals! Data science is about exploring

Stay Tuned for Part 3

Stay tuned for the 3rd and final part of this series that will walk through each cell in the notebook!

Want even more information about how to deploy DSE Docker containers? Check out this excellent blog by Kathryn Erickson: Docker Tutorial.

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.