Apache Cassandra® ALLOW FILTERING: What It Is and When It’s Useful
Check out our Cassandra Indexing page to learn more about how you can get around the ALLOW FILTERING problem with Storage Attached Indexes (SAI) and try a hands-on exercise.
It might not be clear when to use Cassandra ALLOW FILTERING for some CQL queries and not for others.
What is ALLOW FILTERING?
ALLOW FILTERING is a feature in Cassandra Query Language (CQL) that allows you to execute a query despite the potential for unpredictable performance. When a query is executed without ALLOW FILTERING, Cassandra may reject it if it determines that the query may lead to inefficient execution.
By appending ALLOW FILTERING to a query, you can override this rule and allow the query to be executed. However, this should be used with caution, as it can lead to performance issues.
Why ALLOW FILTERING?
Let’s look at the following table:
- 1
- 2
- 3
- 4
- 5
- 6
CREATE TABLE blogs (blogId int, time1 int, time2 int, author text, content text, PRIMARY KEY(blogId, time1, time2));
If you execute the following query:
SELECT * FROM blogs;
Cassandra will return you all the data that the table blogs contains. If you now want only the data at a specified time1, you will naturally add an equal condition on the column time1:
SELECT * FROM blogs WHERE time1 = 1418306451235;
In response, you will receive the following error message:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING.
Cassandra knows that it might not be able to execute the query in an efficient way. It is therefore warning you: “Be careful. Executing this query as such might not be a good idea as it can use a lot of your computing resources.”
The only way Cassandra can execute this query is by retrieving all the rows from the table blogs and then by filtering out the ones which do not have the requested value for the time1 column.
If your table contains for example a 1 million rows and 95% of them have the requested value for the time1 column, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value for the time1 column, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
Unfortunately, Cassandra has no way to differentiate between the 2 cases above as they are depending on the data distribution of the table. Cassandra is therefore warning you and relying on you to make the good choice.
Understanding partition keys and primary keys
In Cassandra, the primary key is a fundamental concept in the table definition. It can be a single attribute or a combination of multiple attributes. The primary key is crucial because it determines the partition key, which is used to distribute data across different nodes in the cluster. The partition key ensures that all the rows with the same key are stored together, making data retrieval efficient. Additionally, the clustering key, which is part of the primary key, helps in sorting and searching for rows within a partition. Understanding the roles of primary and partition keys is essential for designing an efficient data model in Cassandra.
Secondary index and ALLOW FILTERING
If we add an index on the author column and execute the following query:
SELECT * FROM blogs WHERE author = ‘Jonathan Ellis’;
Cassandra will return all the blogs that have been written by Jonathan and will not request ALLOW FILTERING. This is due to the fact that Cassandra can use the secondary index on the author column to find the matching rows and does not need to perform any filtering.
But if we execute the following one:
SELECT * FROM blogs WHERE author=’Jonathan Ellis’ and time2 = 1418306451235;
Cassandra will request ALLOW FILTERING as it will have to first find and load the rows containing Jonathan as author, and then to filter out the ones which do not have a time2 column equal to the specified value.
Adding an index on time2 might improve the query performance, especially when used alongside the clustering column in the WHERE clause.
Cassandra will then use the index with the highest selectivity to find the rows that need to be loaded. It will however not change anything regarding the need for ALLOW FILTERING, as it will still have to filter the loaded rows using the remaining predicate.
Making the right choice for query performance
When your query is rejected by Cassandra because it needs filtering, you should resist the urge to just add ALLOW FILTERING to it. You should think about your data, your model and what you are trying to do. You always have multiple options.
You can change your data model, add an index, use another table or use ALLOW FILTERING. For example, you can use the CREATE TABLE users command to define a table structure that optimizes query performance.
You have to make the right choice for your specific use case.
Performance considerations
Using ALLOW FILTERING can lead to unpredictable performance because it requires Cassandra to load all records and then filter them based on your query criteria. This process can be expensive and resource-intensive, especially for large datasets. To avoid these performance pitfalls, consider designing your partition key to align with your query patterns or creating indexes on columns frequently used in the WHERE clause. Remember, the size of the result set is not directly related to the amount of data scanned when using secondary indexes with ALLOW FILTERING. Always weigh the performance implications before opting to use ALLOW FILTERING.
Best practices for using ALLOW FILTERING
When Cassandra rejects a query because it requires filtering, resist the temptation to simply add ALLOW FILTERING. Instead, take a step back and evaluate your data model, data distribution, and query performance. You have several options to address the issue: you can redesign your data model, add an index, use a different table, or, as a last resort, use ALLOW FILTERING. Each option has its trade-offs, so choose the one that best fits your specific use case. Thoughtful consideration of these factors will help you maintain optimal query performance and resource efficiency.