TechnologyDecember 20, 2024

How to Create a Local LangChain Vector Database

How to Create a Local LangChain Vector Database

Vector databases are the standard storage format used to help add valuable context to your generative AI application prompts. Used together with LangChain, a popular GenAI application programming framework, you can build more accurate AI-powered applications in less time. 

Most dev teams will run vector databases instead in the cloud, as it’s easy and convenient to spin up new instances when needed. However, this can lead to expensive cost overruns. In some cases, using a local vector database can both save money and simplify setting up a local dev stack. 

In this article, we’ll look at the benefits of using a local vector database with LangChain, discuss the best options, and discuss how to manage migrating from a local instance to a hosted one seamlessly. 

Using vector databases with LangChain 

LangChain is a framework available for both Python and JavaScript that simplifies Gen AI application development. It supports composing calls to large language models (LLMs) with other AI app components in a specified order using a simple programming syntax.

LLMs are great at answering general questions with human-language responses. However, their data is usually a few months or years outdated. Plus, the models have no access to domain-specific information - e.g., knowledge of your product catalog, access to your past support cases, etc. - that would enable it to return accurate answers about your problem domain. 

To compensate for this, LangChain offers multiple components to support retrieval-augmented generation (RAG). With RAG, a GenAI app uses information from a user’s prompt to retrieve additional helpful information from another data store, such as a knowledge base. It then supplies that information to the LLM along with the prompt as context. This results in more accurate and up-to-date responses. 

Why use a local LangChain vector database? 

The data storage platform of choice for RAG is a vector database. A vector database converts data into mathematical vector embeddings. It supports searching for approximate matches in a multi-dimensional vector space by returning results on the same trajectory path as the query. 

Using a vector database with LangChain for RAG is easy. LangChain supports document loaders, text splitters, and embeddings generators for indexing and storing information as vectors. It also supports retrievers for querying vector stores and embedding the results in an LLM prompt. 

Most teams, when they go to production, will likely use a vector database hosted in a cloud service provider, such as Amazon Web Services (AWS), or a serverless vector database platform like Astra DB. So, it makes sense that developers would also spin up their own cloud vector database instances for their local dev environments. 

However, unless rigorously managed, this approach can get very expensive, very quickly. Developers may create DB instances much larger than they need for dev. You also have to deal with people leaving instances running and chewing up cloud spend needlessly. 

This makes spinning up a database instance on local developer machines an attractive alternative. Local vector database instances mean: 

 

  • No cloud expense, as each dev only uses their local dev machine resources
  • Less potential waste to track and manage in your cloud accounts
  • Faster performance and less overhead spent debugging connectivity issues with a cloud instance
  • Easy migration - you can usually transfer from a locally-hosted database to a cloud-hosted one just by changing a connection string in your app config 

Spinning up a local LangChain vector database

You can install a local vector database using a package installer (Yum, Apt) or an executable installer. However, the easiest way to start up a local LangChain vector database is using a Docker container. 

Docker runs a virtualized operating system containing all the applications and dependencies you need for development. It’s easy to customize and fast to start up. If you wreck your environment, you can easily toss the current version and start from scratch. 

To get started, you just need to install Docker for Linux/Mac or Windows. Then, download and run a prebuilt Docker image that contains your vector database of choice. 

To simplify development even further, you can build your own Docker image based on the base install that contains pre-loaded data for development. You can load data directly onto the Docker image itself or mount a volume on your local machine using a bind mount.

Needless to say, you need to select a vector datastore that works with LangChain. Fortunately, LangChain supports most popular vector databases to varying degrees, as well as a number of data stores with vector indexing and retrieval capabilities. Here are the ones we recommend using for local vector database development. 

Cassandra 5.0 

Apache Cassandra® is a popular NoSQL database that can handle petabyte-scale data with high availability, performance, and resiliency. As of version  5,0, it supports creating a vector search table, creating a vector index, loading embeddings, and running queries with Cassandra Query Language (CQL), a modified version of SQL. 

Cassandra ships a Docker container that you can get started with immediately by running this command: 

docker run --name some-cassandra --network some-network -d cassandra:tag

This will retrieve the Docker container Apache publishes to Docker Hub and run it on the default Cassandra ports (9042 for client communication, plus other ports for SSH and inter-node communication in a cluster).

If you want to run a cluster of Cassandra instances, you can instead run the container in a network and specify the CASSANDRA_SEEDS variable, which is a list of other instances you can use to bootstrap a new node into the cluster:

docker run --name some-cassandra2 -d --network some-network -e CASSANDRA_SEEDS=some-cassandra cassandra:tag

You can also modify the Cassandra container to use external volumes, load external data, or perform any other configuration you need. The following Docker Compose file, for example, declares the Cassandra container will mount a new volume from /var/lb/cassandra. This ensures that data in the container persists between runs. 

services:
  cassandra:
    image: cassandra:latest
    container_name: cassandra-container
    ports:
      - "9042:9042"
    environment:
      - CASSANDRA_USER=admin
      - CASSANDRA_PASSWORD=admin
    volumes:
      - cassandra-data:/var/lib/cassandra

volumes:
  cassandra-data:

You can then connect your LangChain apps to Cassandra using the Cassandra connector supplied by LangChain.

DataStax Enterprise (DSE) 

Another option is to use our own DataStax Enterprise (DSE) Docker container. This container includes Cassandra 5.0 built with search, analytics, and graph capabilities. 

To run the container with all of the options enabled, run the following command: 

docker run -e DS_LICENSE=accept --name my-dse -d datastax/dse-server:<version tag> -s -k -g

You can connect to the container using Cassandra command line tools installed on your client. Alternatively, you can connect to the instance via cqlsh, Cassandra’s interactive shell:

docker exec -it <container_name> cqlsh

Challenges with local database development

While it has its upsides, developing with a local vector database also has some challenges. 

The biggest is that you need a solid transition plan to move from local dev to prod and pre-prod environments (testing, QA, etc.). The best way to handle this is by using Infrastructure as Code (IaC) to build your environments and ensure your vector database configuration is consistent across environments. 

Additionally, if developers need to operate on a large dataset, running locally may not be an attractive option, as it may require too many resources (or too much time) to load and run the data you require. 

Astra DB: An Al alternative to local LangChain vector database development

An alternative to local development is to use a serverless vector database that provides an affordable option for developers. 

Astra DB is a serverless vector database that scales to petabyte performance. It’s built on Cassandra (DataStax is a major Cassandra contributor), which means your Cassandra CQL calls are compatible with Astra DB. That means you can transition easily from local Cassandra development or proof of concepts to using Astra DB for production-grade workloads. 

Astra DB works seamlessly with LangChain via the Astra DB connector. Even better, when you sign up for Astra DB, you can access Langflow, a no-code/low-code integrated developer environment for GenAI apps that supports both Astra DB and LangChain.

It’s also easy to integrate Astra DB into your data stack. Since it’s serverless, there’s no need to manage another component of your architecture. We handle uptime, patching, and scaling for you. 

While you can still develop locally on Cassandra and transition to Astra DB later, Astra DB also offers a free tier for developers. Devs get a $25/mo. credit, which enables up to 80GB of storage and 20 million R/W operations. That means you can freely develop against Astra DB without worrying you’ll break the bank.

Try it for yourself - sign up for a free DataStax account today.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.