TechnologyMarch 11, 2016

Things you didn’t think you could do with DSE Search and CQL

Things you didn’t think you could do with DSE Search and CQL

Intro

CQL and DSE Search promise to make access to a lucene backed index scalable, highly avaliable, operationally simple, and user friendly.

There have been a couple of developments in DSE 4.8 point releases that may have gone unnoticed by the community of DSE Search users.

One of the main benefits of using DSE Search is that you are able to query the search indexes from through CQL directly from your favorite DataStax driver. Avoiding the solr HTTP API all together means that you:

1) Don't need to two sets of DAO's in your app and have application logic around which to use for what purpose

2) You don't need a load balancer in front of Solr/Tomcat because the DataStax drivers are cluster aware and will load balance for you

3) You don't need to worry about one node going down under your load balancer and having a fraction of your queries failing upon node failure

4) When security is enabled, requests through the HTTP API are significantly slower, to quote the DSE docs:

"Due to the stateless nature of HTTP Basic Authentication, this can have a significant performance impact as the authentication process must be executed on each HTTP request."

Why ever use the HTTP API?

The CQL interface is designed to return rows and columns, so features like Solr's numFound and faceting, were not built in the first few releases.

These features have snuck in via patches in point releases and users that aren't studiously reading the release notes may not have noticed the changes.

How would I go about getting numfound and performing facet queries in the latest (DSE 4.8.1+) version of DSE?

Show me how

If you know you just need the count (and not the data that comes along with it) then you can just specify count(*) and keep the solr_query where clause. DSE intercepts the query and brings back numDocs from DSE Search instead of actually performing the count in cassandra:

SELECT count(*) FROM test.pymt WHERE solr_query = '{"q":"countryoftravel:\"United States\""}' ;

 count
-------
 39709 

Here it is with tracing enabled, notice that even my wide open count(*) query comes back in micros

cqlsh> SELECT count(*) FROM test.pymt WHERE solr_query = '{"q":"*:*"}' ;

 count
--------
 817000

(1 rows)

Tracing session: 7020df80-e7a9-11e5-9c31-37116dd067c6

 activity                                                                                        | timestamp                  | source    | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                              Execute CQL3 query | 2016-03-11 11:51:02.136000 | 127.0.0.1 |              0
 Parsing SELECT count(*) FROM test.pymt WHERE solr_query = '{"q":"*:*"}' ; [SharedPool-Worker-1] | 2016-03-11 11:51:02.136000 | 127.0.0.1 |             34
                                                       Preparing statement [SharedPool-Worker-1] | 2016-03-11 11:51:02.136000 | 127.0.0.1 |             84
                                                                                Request complete | 2016-03-11 11:51:02.146918 | 127.0.0.1 |          10918

The same goes for facet queries. Note that because of the way the cql protocol is designed (around rows and columns), DSE returns the facet results inside a single cell in JSON format. Pretty slick:

select * FROM test.pymt WHERE solr_query='{"facet":{"pivot":"physicianprimarytype"},"q":"*:*"}' ;  
 facet_pivot
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"physicianprimarytype":[{"field":"physicianprimarytype","value":"doctor","count":813638},{"field":"physicianprimarytype","value":"medical","count":720967},{"field":"physicianprimarytype","value":"of","count":92671},{"field":"physicianprimarytype","value":"osteopathy","count":60123},{"field":"physicianprimarytype","value":"dentistry","count":17132},{"field":"physicianprimarytype","value":"optometry","count":11447},{"field":"physicianprimarytype","value":"medicine","count":3969},{"field":"physicianprimarytype","value":"podiatric","count":3969},{"field":""physicianprimarytype","value":"chiropractor","count":192}]}

TL;DR

You don't have to use the HTTP API for seach queries, even if you need numFound and faceting. It is now supported via CQL and solr_query.

Futures

Remember I mentioned that the cql protocol is desinged round rows and columns? Well check out this ticket resolved in C* 2.2.0 beta 1 CASSANDRA-8553. If you use your imagination, there are some improvements that can be made once DSE gets c* 3.0 under the hood to make Search functionality even more slick.

Stay tuned!

More Features!

I meant to stop here but when I asked folks (char0) to review this post, they had some additional DSE Search features that get overlooked. I'll breifly describe them and link to documentation. If you're new to DSE Search definitely read on:

Partiton routing:

Partition routing is a great multi-tennant feature that lets you limit the amount of fan out that a search query will take under the hood. Essentially, you're able to specify a Cassandra partition that you are interested in limiting your search to. This will limit the number of nodes that DSE Search requires to fullfil your request.

JSON queries

If you're looking to do advanced queries thorugh cql (beyond just a simple search) check out the datastax documentation for json queries.

timeAllowed

Many search use cases don't actually require the backend to scan the entire dataset. If you're just trying to fill out a page with search results, and latency matters more than having a complete results set (when you dont care about numFound), the timeAllowed parameter let's you set a maximum latency and DSE Search will just return the results it has found so far.

Please comment if you have any additional DSE Search Features that you think are overlooked!

Sebastian Estevez

#Datastax #Cassandra I like music, technology, and building things @syllogistic on twitter

NYC http://sestevez.com

Discover more
DSE Search
Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.