Building an Open Source RAG Application

Large language models (LLMs) let you build applications with natural language interfaces, providing users with an intuitive experience. But LLMs don’t come packaged with every piece of information you might want for your application. That’s where retrieval-augmented generation (RAG) comes in.

RAG is a technique where you search for the relevant information and then pass it to the LLM to generate the best response. It’s particularly useful when the LLM doesn’t have knowledge about specific, recent, or private data. Here we'll look at how RAG applications work, how to build them, and what to consider when you do.

How RAG applications work

The key to RAG applications is providing the right context. For example, when you have a conversation with someone, each person is doing retrieval augmented generation. You give them some context, and because they know things, they can generate the best response. The better the context, the better the response.

For example, services like Google, Bing, or even ChatGPT with browser access use search engines to retrieve relevant information from the web and then provide the best response based on that. But, as with all RAG systems, the quality of the context is crucial. If you don’t have the right context, the answers won’t be as useful.

How to build a RAG application

Building a RAG application is fundamentally different than building an LLM interface. Instead of letting the LLM just generate a response, the LMM tells you what's in the documents it retrieves. If you went to ask an LLM about DataStax, it might give you some good answers, but for a chatbot on a website, you want an exact answer. In that case, you have the app read some documentation and then use that documentation to give you a correct answer when you ask a question.

This requires some custom rules. For example, you might make a rule that if a user refers to "database," assume they mean Astra DB. If you ask ChatGPT about databases it might veer off and start giving advice that isn’t relevant to DataStax. So, you build a prompt that specifies assumptions that the model should hold.

You can also set some limitations: if the question is not explicitly related to DataStax, you may prompt it to say, "I'm sorry, I only answer questions related to DataStax." Giving the LLM some instructions on what to do if it can't find the answer is important because in many scenarios ChatGPT will give a very confident but incorrect answer. So you want to set up prompt guard rails to have it say, "I don't know."

Chunking text for RAG applications

When a RAG application is searching through documentaton it is trying to capture unique keywords, capture meaning, and synonyms for those meanings. To enable that search, you need to embed your documents as vectors, which starts by deciding how you will break your text into chunks

If your chunks of text are too big, then the meaning becomes diluted; all your text embeddings start becoming the same. They’re all similar because you have one big chunk of text over the entire file.

The smallest chunks of text have the most concentrated meaning. However, there’s a limit. The question becomes “What’s the smallest chunk of text that has the most concentrated meaning, but still has a coherent meaning?” A word is too small because it itself isn’t a coherent intention. A logical conclusion is to chunk by text. But that comes with its own challenges.

Small to large results

Imagine you have the following text in your documentation: “Patrick McCon is a software engineer and VP of development relations. He started with dSTX in 2012. Previously, he worked at Hopson as their Chief Architect.” Now imagine your application recieves the query: "Where does Patrick McCon work?"

If we embedded each sentence, which sentence would the embedding model say is the most relevant? You may think it is the second one, since that contains exactly the information for the prompt. But it is probably the first sentence since McCon is not a very common word.

To fix this, you can use the window strategy. The idea is to combine relevant chunks of text with surrounding chunks. By adjusting the threshold, you can tune it to find the information it needs in response to prompts. This is the “small to large” result paradigm: find the small relevant chunks, then expand them to larger groups of surrounding information to get what you need.

Other considerations for RAG applications

There are a few other factors to consider when setting up your LMs for RAG applications: model temperature and accuracy thresholds.

Temperature

Model temperature is a parameter that controls a models behavior when it generates a response. When you decrease the temperature of a model, its behavior becomes less random. However, even at temperature zero, a model does notgive deterministic output

Although controlling model output is what you want with RAG, temperature zero is actually bad for response generation. When you make temperature zero, the search becomes a greedy search; it always picks the absolute most probable next token. But that inhibits the LLMs ability to look ahead several tokens and find the token that's three tokens deep that are actually more probable than just the absolute next token.

LLMs perform their smartness probabilistically, but when you set temperature to zero, you take away that probabilistic ability. LLMs sample the distribution of response likelihood, and temperature is what controls the shape of that probability distribution.

Accuracy thresholds

RAG applications rely on ANN search to find the embedded vectors most relevant to a query. With ANN searches, you get a relevance score for the retrieva. The relevance score is critical to generating meaningful responses in your RAG application.

How to interpret those scores is up to you. If you only want the most relevant information, you may decide to only use 85 to 90% relevance or better on your retrievals. If the model can’t find any data with a suitable score you can just have it say, “I don’t know” – just like a person would.

Conclusion

RAG applications let you language model interfaces that have access to specific information. Now that you’ve seen how to build guard rails, define text chunks, use small to large result generation, and set our LLM parameters, you’re ready to get started building RAG apps. That’s where DataStax comes in.

DataStax is a one-stop GenAI stack to help you quickly and easily build AI applications – like a RAG system. See what DataStax can do for you by trying it out for free today.