TechnologySeptember 18, 2024

How Knowledge Graph RAG Boosts LLM Results

How Knowledge Graph RAG Boosts LLM Results

Sometimes, retrieval-augmented generation (RAG) systems don’t go deep enough into a document set to find the required answers. We might get generic or shallow responses, or we might get responses where the RAG system retrieves low detail and then fills in the gaps with unrelated or incorrect information — what is called “hallucinations.”

Deep knowledge bases and document sets may contain all the information we need to answer questions in a RAG prompt, but the RAG system might not be able to find it all, especially if the required information is spread across multiple documents and different topics or subtopics. In particular, vector retrieval will often produce a good set of documents, but some concepts within those documents require more information in order for the system to understand them, so it would be helpful to also retrieve additional documents directly related to those concepts.

Some types of data sets that are likely to have these issues:

  • Collections of documents that frequently reference each other
  • Documents with sections, definitions of terms and glossaries, where checking the cross-references is basically the only way to have the complete picture of a given topic
  • Large wikis or knowledge bases in which almost every paragraph contains HTML links to other pages and to external websites

Data sets like this are often found in:

  • Legal documents
  • Technical documentation
  • Research and academic publications
  • Highly interconnected websites

If your organization has deep and complex data sets of interrelated documents and other content, standard RAG implementations might not successfully address some of the most common use cases, especially when prompts ask for detailed explanations that include information at both broad and highly specific levels. Converting the implementation to graph RAG, which means augmenting the RAG system with a knowledge graph that assists with retrieval, can enable the system to delve deeper into data sets to provide detailed, and correct, responses to prompts requesting detailed and specialized information.

Let’s explore the key concepts behind how a knowledge graph can improve performance of a RAG system, what such a graph might look like and how to start building a graph RAG system on your own data.

How does a graph help?

In a nutshell, a knowledge graph combined with a vector store of documents can provide a way to directly connect chunks of text that might not be close or similar to each other in the vector space, and thus are not inherently seen as “relevant” to one another during the retrieval process.

A typical RAG system retrieves documents (or “chunks”) from the vector store that are most relevant to the prompt according to a measure of vector similarity. If those documents contain links or references to other documents, then clearly the authors of the documents thought they were meaningfully related. And if the documents are meaningfully related, why wouldn’t we want to use that information to dig deeper and get more details that might help answer the prompt?

To restate the situation: we have documents that are clearly and directly related — via links or references — and we want to ensure that our RAG system considers those connections when retrieving documents. Building a network of linked documents results in a graph structure that we can traverse to find related documents that might not otherwise be found during typical document retrieval, using a graph to augment RAG; this is known as graph RAG.

The main idea is that we already have an implicit and high-confidence graph that relates documents to one another — via direct links and references — and we want our RAG system to make full use of these known, high-certainty connections before it relies on less-certain vector similarity and relevancy scores to fill in the details in the response, which would run a higher risk of responding with hallucinations.

What types of connections can we use?

The possibilities for defining a graph are limitless, but we’ve found that the best and most effective types of connections for use in graph RAG are those that are well-defined and meaningful. That is, we want it to be clear what is a connection and what is not, so we tend to avoid defining connections for fuzzy concepts like general topic and sentiment. And we want the connections to be meaningful, in the sense that having a connection between two documents in the graph makes it very likely that the content  in each document is relevant to the other. Below are some of the most useful ways to define connections between documents in graph RAG.

One of the clearest and most obvious ways to connect documents in this day and age is to have a direct link from one to the other, in the sense of HTML links in web-based documents. From a human perspective (as opposed to an AI perspective), if we click on a link in one document and end up at another document, there is a link between them. This can be defined and implemented in software with any number of link extraction tools. Generally, the author of the documents has added a link for a reason, and so there is a meaningful connection between them. In this way, HTML links are some of the most well-defined and meaningful links between documents that we could use in our knowledge graphs.

Building a knowledge graph from HTML links has worked very well on data sets such as technical documentation and large wikis or knowledge bases. The interconnected nature of these types of data sets make graph RAG especially useful for diving into specialized details, definitions and subtopics that may not be found by vector search alone.

Some example code for extracting links from HTML documents:

from bs4 import BeautifulSoup
from ragstack_langchain.graph_store.extractors import HtmlLinkEdgeExtractor

html_link_extractor = HtmlLinkEdgeExtractor()

# starting with an HTML document called `html`
soup = BeautifulSoup(html.page_content, "html.parser")
content = select_content(soup, url)

# Extract HTML links from the content.
html_link_extractor.extract_one(html, content)

For an end-to-end example of graph RAG using HTML link extraction to build the graph, check out this recent piece, “Better LLM Integration and Relevancy with Content-Centric Knowledge Graphs.”

Keywords and topics

Although building a graph from connections based on general topics or sentiment can be too fuzzy and uncertain for the purposes of graph RAG, it is often possible to effectively use highly specialized keywords and topics that are well-defined and meaningful. In particular, keywords within a specialized domain can be effective for making connections between documents within graph RAG. Specialized keywords are not always captured in the vector embedding representation of documents, and therefore would benefit from a stronger and more deliberate connection that a knowledge graph would give.

There are some excellent tools for extracting keywords; the following is a very simple example of how to extract keywords using as “keyBERT”:

from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

This extracts specialized domain keywords:

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
 ('machine learning', 0.6305),
 ('supervised learning', 0.5985),
 ('algorithm analyzes', 0.5860),
 ('learning function', 0.5850)]

How we turn these keywords into a knowledge graph depends on our use case and data model. One example can be found in the docs on knowledge graph RAG.

Building a graph with meaningful keywords as nodes connected to the documents in which they appear can be a effective graph RAG strategy. Note that to connect documents to one another via the graph, we have to traverse the graph to a depth of two or more: one step from a document to its keywords and a second step to other documents containing those keywords.

Terms and definitions

In legal documents, academic publications and works of research, we have terms and definitions defined as a list or glossary, usually at the beginning or end of the document. In these cases, it’s helpful to reference these terms and definitions throughout the document so we can always be clear about what’s being said. Without these definitions of terms, some parts of documents can become vague or almost meaningless.

One particularly apt example is the case of a large number of documents that are contracts between tenants and landlords; we’ll query them using our RAG system. The documents would typically be chunked before being loaded into the data store, which means that any terms and definitions appearing at the beginning or end of the documents are not inherently included with the chunks themselves. And because there are many contracts between different tenants and landlords, any chunk that references the word “tenant” or the word “landlord” would be ambiguous without connecting it to the particular tenant and the particular landlord in question.

In this case, it would be extremely useful to have a knowledge graph that explicitly connects document chunks with the appropriate definitions of terms appearing in them. The specific implementation for extracting those definitions and terms, and connecting them to the correct chunks of documents, would depend on the format of the original documents themselves, the structure of the glossary, or definitions relative to the rest of the document, for example. Many text and document parsers are available and appropriate for this purpose, and work is being done to standardize the process with graph RAG in mind.

Document structure: Section references, page numbers, and more

When documents are chunked and loaded into a vector store, all document structure outside of the chunks is lost unless we capture it in some way. For many RAG use cases, It would be helpful for the system to know where each document chunk sits in the overall structure of the document, all headings and subheadings, page numbers and which chunks come immediately before and after the given chunk.

Preserving this information in a knowledge graph connected to each chunk has two main advantages for the purposes of graph RAG. First, knowing where a chunk sits within the document allows us to pull in nearby text, which could be the chunks immediately before and after, text from the same page or text from the same sections — all of which could provide supporting evidence and details for the topics mentioned in the initial chunk. Second, some documents include cross-references to other section numbers, headings and page numbers, and thus it would be helpful to have a knowledge graph allowing the RAG system to directly retrieve the chunks in the sections that are referenced.

How do we build this graph to improve our RAG systems?

We lay out more technical details in this piece on content-centric knowledge graphs, where we explain how to build a knowledge graph from web-based technical documentation using `langchain`, `ragstack`, Cassandra and related tools. We build the knowledge graph from HTML links appearing in the documents, which can be one of the easiest and most useful ways to build a knowledge graph for graph RAG.

To process an HTML document and add appropriate metadata for graph RAG, we can use a helper function such as:

from markdownify import MarkdownConverter
from ragstack_langchain.graph_store.extractors import HtmlLinkEdgeExtractor

markdown_converter = MarkdownConverter(heading_style="ATX")
html_link_extractor = HtmlLinkEdgeExtractor()

def convert_html(html: Document) -> Document:
url = html.metadata["source"]
soup = BeautifulSoup(html.page_content, "html.parser")
content = select_content(soup, url)

# Use the URL as the content ID.
html.metadata[CONTENT_ID] = url

# Extract HTML links from the content.
html_link_extractor.extract_one(html, content)

# Convert the content to markdown and add to metadata
html.page_content = markdown_converter.convert_soup(content)

return html

And once the documents have been processed and the proper metadata has been added, they can be loaded to a graph vector store like the example below, which uses Astra DB as the underlying data store, and `CassandraGraphStore` as the implementation of `GraphVectorStore` , which functions as both the knowledge graph and vector store:

import cassio
from langchain_openai import OpenAIEmbeddings
from ragstack_langchain.graph_store import CassandraGraphStore

# Initialize AstraDB connection
cassio.init(auto=True)

# Create embeddings
embeddings = OpenAIEmbeddings()

# Create knowledge store
graph_store = CassandraGraphStore(embeddings)

...  # load and process your documents, e.g. `convert_html` above

# Add documents to knowledge store
graph_store.add_documents(docs)

Learn more

To learn more about optimizing the construction and use of knowledge graphs for graph RAG, read this recent article, Scaling Knowledge Graphs by Eliminating Edges. This includes an introduction to the handy “GraphVectorStore” in Langchain. 

For the latest updates on how DataStax can help get you started with graph RAG, quickly and with minimal code changes, check out the work we’re doing on RAG with Vector Graph.

This post was originally published in The New Stack

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.