TechnologyJanuary 17, 2025

Simplifying Ground Truth Generation for LLMs

Simplifying Ground Truth Generation for LLMs

Leveraging large language models (LLMs) in critical business processes, customer-facing agents, or compliance-driven scenarios requires accurate, contextual, and verifiable information. 

Ensuring accuracy boils down to how well the model is grounded in your organization’s unique knowledge. Key to accomplishing this: establishing a reliable ground-truth dataset: questions and validated answers that represent the correct and desired responses for a given domain.

The process of generating a ground truth dataset can be a costly, complex, and labor-intensive task. In this post, we’ll explain why it’s so challenging, and how to surmount this challenge using a new ground truth generator toolkit. We’ll demonstrate how to use this automated approach — harnessing the power of LLMs themselves and some straightforward code — to generate high-quality, domain-specific ground truth. By the end of this post, you’ll have a clear roadmap and a practical solution for creating and maintaining a robust ground truth dataset rooted in your own source documents.

What is ground truth?

Ground truth is the ultimate reference point. In traditional machine learning, ground truth labels are the benchmark for evaluating a model’s performance. In the context of LLMs, ground truth Q&A pairs help you:

  • Evaluate how well the LLM’s responses match ideal answers.
  • Fine-tune the LLM to better align with your organization’s standards.
  • Continuously improve the LLM over time by comparing its outputs against a gold-standard reference.

Ground truth provides a stable foundation upon which to measure success. Without it, you have no reliable way to gauge if the model is improving or even performing adequately against your specific criteria.

Why is ground truth needed?

Without ground truth, every answer from your LLM is floating in a vacuum. How do you know if it’s correct? With ground truth, you can compute metrics — accuracy, precision, recall, or other domain-specific key performance indicators (KPIs) — to quantify performance.

Ground truth datasets don’t just measure current performance; they guide ongoing optimization. As your business evolves, documents change, and regulations update, your ground truth can be refreshed to ensure your LLM stays aligned with the current state of knowledge.

In regulated industries, you must often prove that your system bases its advice on credible, up-to-date information. Ground truth offers an audit trail — you can point to the exact set of Q&A pairs that establish what the LLM should be saying.

What are some traditional ways to create ground truth?

Manual annotation and expert curation

Traditionally, subject matter experts (SMEs) read through documents, highlight key information, and manually create Q&A pairs. This approach is time-consuming, costly, and doesn’t scale easily. For organizations dealing with thousands of pages of content, manual methods quickly become impractical.

Crowdsourcing and surveys

Another method is to use crowdworkers or conduct surveys with participants who create or validate Q&A pairs. While this can scale better than a small team of experts, it can still be expensive, and quality control becomes challenging.

Why is creating ground truth difficult?

Domain expertise is scarce and expensive

You need specialized knowledge to ensure the Q&A pairs are correct and high-quality. Domain experts are in high demand and can be costly, especially if the ground truth needs frequent updates.

Volume and complexity of documents

As organizations grow, so does their documentation: internal wikis, knowledge bases, policy manuals, research reports, and more. Manually creating ground truth from vast, complex corpora is a monumental task.

Maintenance

Ground truth must evolve as your data evolves. Constant changes mean periodically updating and re-validating large portions of your dataset — a maintenance nightmare if done manually.

A toolkit to generate ground truth from original documents

We’ve built a toolkit to assist in generating ground truth from your documents. Instead of manually crafting Q&A pairs, it enables you to prompt an advanced model to read through your text and produce them automatically. Then, you can spot-check the pairs for accuracy, saving a lot of time and effort.

Extracting information purely from text can sometimes introduce inaccuracies, especially when documents use various formats—such as tables, bullet points, or other structural elements—to represent data. These nuances can be lost or misread when content is converted to plain text.

To address this, the toolkit employs an image-based workflow. Instead of relying on text alone, we transform your documents into images and feed those into the LLM. This approach preserves the original layout and structure, ensuring that all visual and spatial cues remain intact and meaningful to the model. You can then guide the LLM with a carefully tailored prompt, validate the resulting Q&A pairs, and refine them as needed. By preserving the document’s visual context, our toolkit delivers more accurate and reliable ground truth datasets—faster than traditional methods.

To begin, you’ll need an LLM. You can use any LLM provider—this example demonstrates how to configure OpenAI’s GPT-4 or Google Gemini Flash, both of which require an API key. The sample code below illustrates how to generate question-and-answer pairs from a provided text and then save these pairs to a file for further analysis.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
import pandas as pd
import os
import google.generativeai as genai

from pydantic import BaseModel, Field

class GroundTruth(BaseModel):
    question: str = Field(..., title="Question")
    answer: str = Field(..., title="Answer")

class GroundTruthResponse(BaseModel):
    qa_pairs: list[GroundTruth] = Field(..., title="List of question and answer pairs")

def generate_ground_truth(docs: list, 
                          save_to_file: bool = True, 
                          file_name: str = 'qa_output.csv'):

    prompt = PromptTemplate(
        input_variables=["doc"],
        template="""
        Analyze the given text, generate accurate question and answer pairs from the given text only. 
        Scope of question and answer should be solely based on the given text.
        Generate at least 1, up to 3 question-answer pairs for each text. Do not create one-word answers. All Q&A should be in English.

        Here is the text:
        {doc}
        """
    )
    llm = ChatOpenAI(model_name='gpt-4', temperature=0)
    llm = llm.with_structured_output(GroundTruthResponse)
    chain = prompt | llm
    qa_list = []

    for doc in docs:
        output: GroundTruthResponse = chain.invoke(doc)
        qa_list.extend(output.qa_pairs)

    # Optionally save to a CSV file
    if save_to_file:
        data = [{'question': g.question, 'answer': g.answer} for g in qa_list]
        df = pd.DataFrame(data)
        file_exists = os.path.isfile(file_name)
        df.to_csv(file_name, mode='a', index=False, header=not file_exists)

    return qa_list

Why use an image-based approach for PDFs and other document formats?

Organizations often hold crucial information locked away in PDF files and other document formats that aren’t simple text. These documents often contain intricate layouts, embedded tables, charts, and images. Traditional text extraction methods flatten this complexity into plain text, losing critical contextual cues such as headings, visual hierarchies, and spatial relationships.

Headings and layout

A PDF isn’t just a series of words—it’s a carefully designed page. Headings signal importance, sidebars offer supplementary insights, and layout choices guide the reader’s understanding. When you rely solely on text extraction, you strip away these visual markers. Multi-column layouts can become confusing streams of text, and captions or footnotes may end up detached from the elements they describe. By starting with the PDF as an image, you keep the layout intact and provide the model with a more faithful representation of the document’s structure.

Tables and charts

Tables and charts convey relationships among data points that aren’t easily captured when converted to text alone. Traditional extraction methods often scramble tables, merging cells or losing crucial formatting, making it hard to recreate the intended meaning. An image-based approach preserves these data visualizations exactly as they appear. This allows an LLM, in conjunction with image-processing techniques, to interpret rows, columns, and headers in context, producing Q&A pairs that accurately reflect the original table’s logic and insights.

Images and diagrams

PDFs frequently contain images, diagrams, and icons that provide essential clues. A technical manual, for example, might show a labeled diagram of a machine part. Without images, these references become meaningless lines of text. Converting PDFs to images ensures the model can “see” these visual elements. When generating Q&A pairs, it can consider an illustration’s location and its associated caption, leading to questions and answers that encompass the full meaning of the content. For example, a generated question might reference an annotated diagram directly—something pure text extraction simply can’t achieve.

Ultimately, converting PDFs into images before processing them leads to a ground truth dataset that is truer to the original source material.  No matter how diverse your documents’ formats are, their content remains accessible in a rich, visually consistent manner. Every element—table cells, images, captions, and layout nuances—is considered. This fidelity helps the LLM generate Q&A pairs that capture the full depth and context of the documents. Such a dataset empowers your LLM-driven applications to provide more reliable, nuanced answers grounded firmly in the actual content and structure of your documents.

Conclusion

As LLMs transition from experimental tools to production-grade systems, their effectiveness hinges on reliable grounding and accurate ground truth data. By anchoring your LLM in authoritative sources—your own organizational documents, faithfully represented in their original layout and format—you ensure that answers are both domain-relevant and contextually precise. This approach reduces hallucinations, fosters trust, and supports compliance in settings where accuracy is non-negotiable.

Traditionally, building and maintaining a comprehensive ground truth dataset has been a labor-intensive task. Yet with the help of LLMs themselves, you can automate much of the work, transforming a daunting, manual process into a streamlined, iterative workflow. By combining automated Q&A generation, careful validation, and the flexibility of storage solutions like CSV files or Astra DB, you can swiftly create and refine a gold-standard dataset that reflects your evolving documentation.

 

Key takeaways

  • Grounding ensures relevance: LLM responses remain tightly bound to verified, up-to-date organizational knowledge.
  • Ground truth provides a benchmark: It measures model performance, guiding iterative improvements.
  • Automation simplifies complexity: Generating Q&A pairs directly from your documents saves time and preserves nuanced context, including layout and visuals.
  • Future-proofing your LLM stack: Regular updates to your ground truth dataset help keep your AI agents current, consistent, and trustworthy.

Check out this GitHub repository for scripts and examples that you can adapt and extend for your own use cases. Armed with these insights and the provided code examples, you can confidently embark on creating and maintaining a robust ground truth dataset. Embrace this approach, tailor it to your own enterprise needs, and unlock the potential of well-grounded, high-quality language models in production.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.