TechnologyJanuary 16, 2020

Is Running Count(*) Expensive on Cassandra?

Is Running Count(*) Expensive on Cassandra?

Count(*) is a function that returns the number of rows in a table. Running count(*) is expensive in any database, because a full table scan is required. Because Cassandra is a distributed operational database, the tables are spread across all the nodes in the cluster. A count(*) will need to scan all the SSTables on all the nodes in the cluster. A developer or DBA will run count(*) after a batch load to validate the number of records loaded or to find the number of rows in a table.

When you’re running the function in Apache Cassandra®, here are a few things to understand:

  • Cassandra is a distributed database and it goes by the CAP (consistency, availability, and partition tolerance) theorem—and not ACID (atomicity, consistency, isolation, and durability). 
  • The core Cassandra is designed for high ingestion rate and low millisecond response.
  • Cassandra spreads the data across all the nodes in the cluster, and running count(*) results in loading all the nodes in the Cassandra cluster, which we don’t recommend in production as it impacts the write and read latency for your operational workload. 
  • The Cassandra Query Language (CQL) has a default timeout of 10 seconds to get the response back from the cluster.  

The good news is that there are a number of other ways you can count the number of rows in a table that don’t require running the count(*) function.

Here are some of your options:

  • DataStax Bulk Loader, dsbulk (recommended option)
  • DataStax Analytics (Spark), if you have enabled it in the cluster
  • DataStax Search (Solr), if you have enabled it in the cluster.
  • For small tables, you can increase the CQL timeout:
    cqlsh -u cassandra -p cassandra --request-timeout=3600
  • The COPY command:
    cqlsh -e "copy keyspace.table_name (first_partition_key_name) to '/dev/null'" | sed -n 5p | sed 's/ .*//'


So there you have it. You can run the count(*) function in Cassandra just like you would in other databases. But it might be a bit pricey. Instead, use one of the options outlined above to save resources and still find the answers you’re looking for.

We've made it easy and cost effective to count your data. Give it a try and let us know what you think. 

Looking to take your Cassandra skills to the next level? Check out DataStax Academy for more.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.