DataStax Integrates NVIDIA NIM for Deploying AI Models
DataStax is broadening retrieval and augmentation use cases by integrating NVIDIA NIM and NVIDIA NeMo Retriever microservices in DataStax to deliver high-performance retrieval-augmented generation (RAG) solutions with fast embeddings. NVIDIA microservices, part of the NVIDIA AI Enterprise software platform, provide developers with a set of easy-to-use building blocks to accelerate deployment of generative AI across enterprises.
End users of generative AI applications expect real-time responses. But the complexity of accessing structured and unstructured data can introduce latency that impacts the user experience—and up to 40% of that latency can come from calls to the embedding service and vector search service.
That’s why DataStax is improving the delivery of retrieval-augmented use cases by integrating NVIDIA microservices into DataStax Astra DB to deliver high-performance RAG data solutions.
The challenge
To make large language models (LLMs) practical for enterprise users, RAG combines pre-trained language models with a retrieval system that enables enterprises to talk to their own data. RAG is incredibly useful because it reduces hallucinations, helps LLMs to be more specific by using enterprise data, and can be refreshed as quickly as new information becomes available. It’s also a more resource-efficient approach as the embedded retrieval inference costs are substantially lower than batch training custom models.
The enterprise corpus of unstructured data - from software logs to customer chat history - is a gold mine of valuable domain knowledge and real-time information necessary for generative AI applications. Calls to the embedding service represents 10-40% of the total latency for a given RAG pipeline with customers. However, many companies face the daunting technological and cost prohibitive challenge of vectorizing their existing and newly added unstructured data for LLM inference. The challenge is compounded by the need to generate embeddings in near real time, and indexing the information in a vector database.
DataStax is working with NVIDIA to solve this problem. The NVIDIA Retriever embedding microservice generates over 1,200 embeddings per second per GPU at double-digit millisecond latencies, pairing well with a highly scalable NoSQL datastore like Astra DB, which is able to ingest new embeddings at 4000+ TPS at single-digit millisecond latencies on low-cost commodity storage solutions/disks. DataStax, with NIM, provides ~11.9 millisecond embedding+indexing latencies, 4000+ of ops/second, and commensurate lower operational costs.
NVIDIA NeMo Retriever is a collection of generative AI microservices enabling organizations to seamlessly connect custom models to diverse business data and deliver highly accurate responses. It enhances generative AI applications with RAG capabilities that can be connected to business data wherever it resides.
Why it matters: Skypoint’s use case
Tisson Mathew, CEO and founder of healthcare data solutions provider Skypoint, noted the importance of speed when it comes to providing a great experience for customers.
“At Skypoint, we have a strict SLA of five seconds to generate responses for our frontline healthcare providers,” Mathew said. “Hitting this SLA is especially difficult in the scenario that there are multiple LLM and vector search queries. Being able to shave off time from generating embeddings is of vast importance to improving the user experience.”
Benchmarks
To understand the impact of performant RAG solutions, we benchmarked the NVIDIA NeMo NV-Embed-QA vector embedding model (for generating vector embeddings) and the DataStax Astra DB Vector database (for storing and managing vectors at scale). We ran the test harness (Open Source NoSQLBench) on the same machine as the model, deployed on Docker containers. The performance tests measured the following three key metrics:
- Embedding Latency: Time to generate an embedding
- Indexing / Query Latency: Time to store / query the generated embedding
- Overall Throughput: Number of processed inputs through the system per second
- Cost: Hardware and software cost to process tokens
We ran the benchmarks on a single NVIDIA A100 Tensor Core GPU, which demonstrated increased performance from ~181 requests/second to almost 400 ops/sec. Tuning NVIDIA TensorRT software – included with NVIDIA AI Enterprise for production-grade AI – with a tokenization/preprocessing model improved performance by another 25%. We then switched to the latest NVIDIA H100 80GB Tensor Core GPU (a single a3-high gpu-8g instance running on Google Cloud), which resulted in throughput doubling to 800+ ops/sec.
We’ve also looked at configurations that lowered the latency, and found that we can achieve ~365 ops/sec at a ~11.9 millisecond average embedding + indexing time - which is 19x faster than popular cloud embedding models + vector databases.
When combined with NVIDIA NIM and NeMo Retriever microservices, Astra DB and DataStax Enterprise (DataStax’s on-premises solution) provide a fast vector DB RAG solution that’s built on a scalable NoSQL database that can run on any storage medium.
We’ve also launched in developer preview a new feature - Vectorize - which performs the embedding generation at the database tier. So instead of customers managing their own microservices for generating embeddings, Astra DB will have its own microservices instance and directly pass the cost savings to the customer.
Today, the Datastax vector database uses DiskANN to process 4,000 inserts per second and make them immediately searchable. While this keeps the data fresh, the tradeoff is reduced efficiency. We are working with NVIDIA to accelerate our vector search algorithms by integrating RAPIDS cuVS to improve efficiency and maintain data freshness.
To get started, sign up to access DataStax’s Astra DB and NVIDIA NIM and NeMo microservices. Then get started with the NVIDIA microservices / DataStax template to experience fast vector search at scale for yourself. Finally, check out the testing dashboard to identify what RAG use cases you would like to implement with NVIDIA NIM and DataStax.