Vector Search Is Coming to Apache Cassandra
There’s no artificial intelligence without data. And when your data is scattered all over the place, you’ll spend more time managing the implementation process instead of focusing on what’s most important: building the application. The world's most prominent applications already use Apache Cassandra, so increasing data efficiency is an increasingly important goal. AI is all about scale, and bringing vector search — a key component in using AI models — into Cassandra will help organizations slash costs, streamline their data management and squeeze every last drop of value from their data.
This cutting-edge feature, recently outlined in a Cassandra enhancement proposal (CEP-30), is further evidence of the Cassandra community’s commitment to building reliable features fast. It’s also a testament to Cassandra’s growing appeal to AI developers and organizations grappling with massive data sets, providing them with the tools to create advanced, data-driven applications.
What is vector search?
The well-established concept of text search has been around for a long time. It involves searching for a particular keyword within documents. But important data can be found in more than just text: audio, images and video (or some combination) also contain relevant information that requires a search method. That’s where vector search comes in. It’s been in use for some time now, and it has proven to be quite valuable in various applications, especially in the AI and machine learning fields.
Also known as vector similarity search, there are two parts required to elevate your search game. First, the raw data must be indexed into a vector representation (an array of numbers) that serves as a mathematical description. Second, the vector data needs to be stored in a way that developers can ask, “Given one thing, what other things are similar?” It’s simple and powerful for developers, challenging to implement at scale on the server side. This is where Cassandra will really shine by consistently serving data at any scale around the world with resilience that grants peace of mind.
By no means is this meant to be a full deep dive into vector search, but more of an explanation into what it can do for your application by creating an entirely new dimension of useful data to reduce code complexity and get into production faster with features users want.
Real-world practical examples of vector search include:
- Content-based image retrieval, where visually similar images are identified based on their feature vectors. Using a library like img2vec, you can convert an image file into 512 unique identifiers that can be used for similarity search.
- Recommender systems, where products or content are recommended to consumers based on similarity to items they have previously interacted with.
- Natural language processing applications, where semantic similarities between textual content can be identified and leveraged for tasks such as sentiment analysis, document clustering and topic modeling. This is typically done using tools like word2vec and can require the scale Cassandra delivers.
- Use ChatGPT? Vector search is critical for the Large Language Model (LLM) use case as it enables efficient storage and retrieval of vector embeddings, representing the distilled knowledge gained during the LLM training process. By performing similarity searches, vector search can quickly identify the most relevant embeddings corresponding to a user's prompt. This helps LLMs generate more accurate and contextually appropriate responses while also providing a form of long-term memory for the models. In essence, vector search is a vital bridge between LLMs and the vast knowledge bases on which they are trained.
What’s coming to Cassandra
The Cassandra project is on a never-ending quest to make Cassandra the ultimate powerhouse in the database universe. As previously mentioned, after you convert your data into vector embeddings, you’ll need a place to store and use them. Those capabilities are being added to Cassandra, exposed in a simple yet powerful way.
Vector data type
To support the storage of high-dimensional vectors, we’re introducing a new data type, `VECTOR<type, dimension>
`. This will enable the handling and storage of Float32 embeddings, which are commonly used in AI applications. This has already resulted in discussions to add Cassandra into AI libraries like LangChain. In this example, imagine the creation of a vector from the description to enable a semantic similarity search.
CREATE TABLE products(
id UUID PRIMARY KEY,
name varchar,
description varchar,
item_vector VECTOR<float, 3>
);
ANN search index
We will add a new storage-attached index (SAI) called “VectorMemtableIndex,” which will accommodate the approximate nearest neighbor (ANN) search functionality. This index will work in conjunction with the new data type and Apache Lucene's Hierarchical Navigable Small World (HNSW) library to enable efficient vector search capabilities within Cassandra.
CREATE CUSTOM INDEX item_ann_index ON product(item_vector)
USING 'VectorMemtableIndex';
ANN operator in CQL
To make it easier for users to perform ANN searches on their data, we will introduce a new Cassandra Query Language (CQL) operator, ANN OF. This operator will allow users to efficiently perform ANN searches on their data with a simple and familiar query syntax. Continuing the example, developers can ask the database for something similar to a vector created from a description.
SELECT * FROM product WHERE item_vector ANN OF [3.4, 7.8, 9.1]
Highlighting Cassandra’s extensibility
When Cassandra 4.0 was released, one of the easily overlooked highlights was the concept of increased pluggability. The new vector search functionality in Cassandra is built as an extension to the existing SAI framework, avoiding a rewrite of the core indexing engine. It uses the well-known and widely used HNSW functionality in Lucene, which provides a fast and efficient solution for finding approximate nearest neighbors in high-dimensional space.
Cassandra 4.0's new addition highlights its remarkable modularity and extensibility. With the integration of HNSW Lucene and the expansion of the SAI framework, developers can now access a wide range of production-ready features much faster. Developers have access to numerous vector databases, and many of them prefer to build a vector indexing engine before adding storage. Cassandra has successfully tackled the challenging issue of data storage at scale for over a decade. We are highly confident that including vector search in Cassandra will provide even more exceptional production-ready features.
New use cases
Cassandra isn’t new to machine learning and AI workloads. Long-time Cassandra users have been using Cassandra as a fast and efficient feature store for years. It’s even rumored that OpenAI uses Cassandra heavily in the building of LLMs. These use cases all employ Cassandra’s existing functionality. There will be many ways to use the new vector search. It will be exciting to see what our community comes up with but they will likely fit into two categories:
Enhance an existing use case with ANN search
If you already have an application built on Cassandra, you can enhance its capabilities by incorporating ANN (“approximate nearest neighbor”) search. For instance, if you have a content recommendation system, you can use ANN search to find similar items and improve the relevance of your recommendations. Product catalogs can denormalize features into embedded vectors stored in the same record. Fraud detection can be further enhanced by mapping behaviors to features. Think of a use case and it is probably relevant.
Build something new that needs vector search
If you’re starting a new project that requires fast similarity search capabilities, Cassandra's new vector search feature will be an excellent choice for data storage and retrieval. Knowing you can go from gigabytes to petabytes on the same system will let you focus on building your application and not worrying about tradeoffs. In addition to storing vector embeddings, you’ll have the full power of CQL and the tabular storage of a full-featured database all thrown in.
However you consume Cassandra, these options will all be available. If it’s your own deployment using open source Cassandra, deployed in Kubernetes using K8ssandra or in the cloud with services like DataStax Astra DB, you’ll get the same great system. The freedom you get with open source is the freedom to choose how you build your applications.
Built by and for developers
As we continue to innovate and expand the capabilities of Cassandra, we remain committed to staying at the forefront of what you need in data management. The introduction of vector search is an exciting new use case that will make your data-driven applications even more powerful and versatile. This, with some of the other cutting-edge features like distributed ACID transactions at scale, will make Cassandra 5.0 the most significant upgrade you can make. We aren’t stopping here, either. The companies and developers that support Cassandra are hard at work thinking up more ways to consolidate your data, simplify management and save money.
We're confident that this addition will help not only AI developers but also organizations managing large data sets that can benefit from fast similarity search. So keep an eye out for the alpha release of Cassandra with vector search functionality, slated for sometime in Q3. We look forward to seeing the fantastic applications you'll build with this new feature, and we’d love it if you shared your use cases with the community at Planet Cassandra.