Unstructured Integration

Data ingestion just got a lot easier with Unstructured and Astra DB.

Overview

Enterprise data comes in a wide variety of challenging formats: plain text, images, and document types ranging from HTML and PDF to CSV, PNG, and PPTX, to name a few. This complicates the crucial data-preparation step developers have to deal with when building retrieval-augmented generation (RAG) and generative AI applications.

That’s why Unstructured is such a powerful platform; it enables developers to convert any document, file type, or layout into LLM-ready data. Unstructured is a no-code cloud service that stands up GenAI data pipelines, from transformation and cleaning to generating embeddings for a vector database.

What is Unstructured with Astra DB?

Unstructured’s integration with DataStax Astra DB, the NoSQL and vector database for GenAI, enables developers to build RAG pipelines to quickly convert the most common document types into vector data for highly relevant GenAI similarity searches.

Building Your Application with Unstructured and Astra DB

Take a look at this Python tutorial to get started building a RAG pipeline powered by Astra DB Serverless. The code builds an LLM-based query engine and retrieves parsed data to provide contextual insights for users.

The Unstructured Python client library for document parsing is also included in RAGStack. It’s for enterprises that want a curated, supported, out-of-the-box GenAI stack for enterprise RAG applications, leveraging LangChain and LLamaIndex. For details, see RAG with Unstructured and Astra DB. See below for more common use cases and how to get started.

Processing PDFs

Unstructured can be used to process a set of reports into vectorized data within Astra DB. By simply pointing to a directory of these files, Unstructured will automatically parse and process the files, generating vector embeddings, and storing the results into Astra DB with minimal user effort. Vectorized data enables powerful downstream use cases, including RAG applications built on top of the data. This enables the use of LLMs to easily access information and answer questions about information stored in proprietary PDF reports. Example code is provided in the Unstructured documentation:

from unstructured.partition.pdf import partition_pdf

# Returns a List[Element] present in the pages of the parsed pdf document
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf")

# Applies the English and Swedish language pack for ocr. OCR is only applied
# if the text is not available in the PDF.
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", languages=["eng", "swe"])

Web URL Scraping

Unstructured with Astra DB can also facilitate a web-scraping framework, pulling data from HTML web pages, parsing them into structured text data, and generating the embeddings for storage into Astra DB. This provides a powerful way to perform a number of important tasks. For example, internal and external documentation pages can be parsed so that developers can build chatbots that enable users to query information from the documentation.

from unstructured.partition.html import partition_html

url = "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html"
elements = partition_html(url=url)
print("\n\n".join([str(el) for el in elements]))

Building an Email Database

Unstructured includes the ability to handle email messages and perform much of the same processing as shown above. While search is often an effective way to retrieve information, using an LLM to access information within emails can power a wide variety of use cases. For example, developers can ask the LLM to retrieve all emails that include receipts, or that discuss a particular topic.

from unstructured.partition.email import partition_email

elements = partition_email(filename="example-docs/fake-email.eml")

with open("example-docs/fake-email.eml", "r") as f:
    elements = partition_email(file=f)
Unstructured Integration's logo
CategoryData Ingestion
Documentationdocs.datastax.com

Integrate Unstructured with Astra DB Serverless

Data ingestion just got a lot easier with Unstructured and Astra DB.

FAQ

What is Unstructured?

Unstructured connects enterprise data to LLMs, no matter the source. The platform effortlessly extracts and transforms complex data for use with vector databases and LLM frameworks.

What is Astra DB?

The Astra DB vector database gives developers a familiar, intuitive Data API for vector and structured data types, and all the ecosystem integrations required to deliver production-ready generative AI applications on any infrastructure with unlimited scale.

How does Unstructured work?

Unstructured transforms data by extracting it from a source document file, partitions it for cleaning, performs chunking and metadata generation, and then renders the results into a normalized JSON format. Vector embeddings are generated via a number of supported model hosts (including Hugging Face, AWS Bedrock, and OpenAI). Upon completion of the pipelines, the data is written to Astra DB.

When is it best to use the Unstructured integration?

The goal of Unstructured is to take any form of unstructured data (PDFs, emails, word docs, webpages, etc) and convert them into a structured format that can be neatly parsed, vector embeddings can be generated, and then downstream tools like LlamaIndex or LangChain can be used to build apps on top of that data. There are a wide variety of cases where Unstructured can be used, but it comes in particularly useful when building an AI pipeline that references unstructured data rather than something that's already nicely formatted as, for example, a CSV or Excel sheet.

Do I need an Unstructured account to access this integration?

The open-source version of Unstructured can be installed with:

pip install "unstructured[all-docs]"

See more detailed installation instructions in the GitHub readme.