Look Ma! No ETL!
Introduction
Solr and Hadoop are two big open source technologies that we have integrated in DataStax Enterprise on top of Cassandra. For those just joining us, Solr allows for full search, and Hadoop provides a distributed file system and allows processing large datasets via MapReduce. In the traditional world, if you wanted to run MapReduce over some data and also do searches over that same data, you would have to ETL that data to your Solr cluster, which has all the pitfalls of trying to keep the data in sync between the two clusters. The beauty of DataStax Enterprise is that with the right replication setting you can search and do mapreduce operations over the same dataset with ease. In this example I'll be using a modified dataset from a survey done by The Pew Research Center about Facebook habits and attitudes.
Environment
This demonstration was run on my EC2 cluster , 2 m1.large Ubuntu 12.04 with a binary install of DSE 3.0.4
The cluster has been setup to have 2 virtual datacenters or DCs, an Analytics DC with a node running Hadoop, and a Solr DC with a node running Solr.
Files
To begin we need to get the survey file: Omnibus_Dec_2012_csv
I've modified this survey file from the original by removing many of the columns, our primary focus will be two columns pial1a and pial4vb which map to these two questions
PIAL1A As I read the following list of items, please tell me if you happen to have each one, or not. Do you have... [INSERT ITEMS IN ORDER]? a. A handheld device made primarily for e-book reading, such as a Nook or Kindle e-reader 1 Yes 2 No 8 (DO NOT READ) Don’t know 9 (DO NOT READ) Refused PIAL4vb What made you decide to stop using Facebook?
Secondly we need to create a solr schema file so that DSE Solr understands how to import the data, index, and store the data in Cassandra. Copy and paste this to a file called answers_schema.xml . This schema tells Solr how to index our documents,
<?xml version="1.0" encoding="UTF-8" ?> <schema name="datatypes_test" version="1.0"> <types> <fieldType name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> </analyzer> </fieldType> <fieldType name="long" class="solr.LongField" multiValued="false"/> <fieldType name="int" class="solr.IntField" multiValued="false"/> </types> <fields> <field name="psraid" type="long" indexed="true" stored="true"/> <field name="pial1a" type="int" indexed="true" stored="true"/> <field name="pial1b" type="int" indexed="true" stored="true"/> <field name="pial1c" type="int" indexed="true" stored="true"/> <field name="pial1d" type="int" indexed="true" stored="true"/> <field name="pial4vb" type="text" indexed="true" stored="true"/> <field name="pial7vb" type="text" indexed="true" stored="true"/> </fields> <defaultSearchField>pial4vb</defaultSearchField> <uniqueKey>psraid</uniqueKey> </schema>
And lastly we are going to use the solrconfig.xml provided to us from the wikipedia demo that ships with DataStax Enterprise.
cp dse/demos/wikipedia/solrconfig.xml .
Solr
We will create the keyspace to store our survey data first and set the replication strategy and options such that data will be available in both the Solr DC and the Analytics DC. By default DSE Solr would only store data in the Solr DC.
$ cqlsh Connected to blog at localhost:9160. [cqlsh 2.2.0 | Cassandra 1.1.9.8 | CQL spec 2.0.0 | Thrift protocol 19.33.0] Use HELP for help. cqlsh> create KEYSPACE answers WITH strategy_class = 'NetworkTopologyStrategy' and strategy_options:Solr=1 and strategy_options:Analytics=1;
Now we can upload the solrconfig and answers_schema xml files up to DSE Solr, this process will automatically create a column family named fbsurvey under the answers keyspace along with the columns and the appropriate metadata.
$ curl http://localhost:8983/solr/resource/answers.fbsurvey/solrconfig.xml --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8' SUCCESS $ curl http://localhost:8983/solr/resource/answers.fbsurvey/schema.xml --data-binary @answers_schema.xml -H 'Content-type:text/xml; charset=utf-8' SUCCESS $ curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=answers.fbsurvey" <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">1612</int></lst> </response>
Now we can upload the survey csv data and have Solr process the data and store it back into Cassandra. We can do a quick count and see the # of records, and check to see that the data transferred over.
$ curl http://localhost:8983/solr/answers.fbsurvey/update --data-binary @Omnibus_Dec_2012_csv.csv -H 'Content-Type:application/csv; charset=utf-8' <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">2553</int></lst> </response> $ cqlsh Connected to blog at localhost:9160. [cqlsh 2.2.0 | Cassandra 1.1.9.8 | CQL spec 2.0.0 | Thrift protocol 19.33.0] Use HELP for help. cqlsh> select count(*) from answers.fbsurvey; count ------- 1006 cqlsh> select * from answers.fbsurvey limit 1; KEY | _docBoost | pial1a | pial1b | pial1c | pial1d | pial4vb | pial7vb --------+-----------+--------+--------+--------+--------+---------------------------------------------------------------+--------- 201734 | 1.0 | 2 | 2 | 1 | 1 | WASNT INTERESTED. TAKING ME AWAY FROM SOCIAL LIFE AND FAMILY. |
Now we can search using SOLR's HTTP API and find out how many people mentioned a COMPUTER or FAMILY in their response to why they stopped using Facebook.
The query I'm using here has some added parameters which will properly indent the response for us, as well as only show me the two columns I'm interested in lookin at, the id and pial4vb which contains the person's response.
automaton@ip-10-82-235-115:~$ curl "http://localhost:8983/solr/answers.fbsurvey/select/?q=pial4vb:(COMPUTER%20OR%20FAMILY)&indent=true&fl=psraid,pial4vb" <?xml version="1.0" encoding="UTF-8"?> <response> <result name="response" numFound="3" start="0"> <doc> <long name="psraid">102113</long> <arr name="pial4vb"> <str>NO COMPUTER</str> </arr></doc> <doc> <long name="psraid">201382</long> <arr name="pial4vb"> <str>NO COMPUTER</str> </arr></doc> <doc> <long name="psraid">201734</long> <arr name="pial4vb"> <str>WASNT INTERESTED. TAKING ME AWAY FROM SOCIAL LIFE AND FAMILY.</str> </arr></doc> </result> </response>
No computer? Ouch.
Hadoop
Now we hop over to our Hadoop node so we can run some MapReduce jobs over our data that we've imported via Solr. In this example we will use Hive which uses a very SQL like syntax that many of you will be familiar with that makes using MapReduce easy to use. We can easily reference the data in Cassandra by using the name of the keyspace as our database, and the name of the column family as our table in SQL parlance. Let's see who answered yes to owning an e-reader and gave a significant response as to why they don't use Facebook anymore.
$ dse hive Logging initialized using configuration in file:/home/automaton/dse/resources/hive/conf/hive-log4j.properties Hive history file=/tmp/automaton/hive_job_log_automaton_201307161535_1235802152.txt hive> use answers; hive> select row_key from fbsurvey where pial1a=1 and length(pial4vb) > 20; ... Ended Job = job_201307151605_0036 MapReduce Jobs Launched: Job 0: Map: 3 Cumulative CPU: 3.23 sec HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 230 msec OK 101596 100454 102582 100223 100822 100161 200933 101334 200495 100032 200694 Time taken: 22.862 seconds hive>
Summary
This example is just the tip of the iceberg in what you can do with Cassandra, Solr, and Hadoop and in DataStax Enterprise your data can be used how you see fit without having to wait or worry about ETL. I glossed over a lot of concepts about Hadoop and Solr in regards to how it all ties to Cassandra in the demonstration, but if you want to know more continue on to the additional reading. If you want to try DataStax Enterprise yourself download it here from this link.
Additional Reading
DataStax Enterprise Hadoop
DataStax Enterprise Search