What is Unstructured?
Unstructured connects enterprise data to LLMs, no matter the source. The platform effortlessly extracts and transforms complex data for use with vector databases and LLM frameworks.
Data ingestion just got a lot easier with Unstructured and Astra DB.
Enterprise data comes in a wide variety of challenging formats: plain text, images, and document types ranging from HTML and PDF to CSV, PNG, and PPTX, to name a few. This complicates the crucial data-preparation step developers have to deal with when building retrieval-augmented generation (RAG) and generative AI applications.
That’s why Unstructured is such a powerful platform; it enables developers to convert any document, file type, or layout into LLM-ready data. Unstructured is a no-code cloud service that stands up GenAI data pipelines, from transformation and cleaning to generating embeddings for a vector database.
Unstructured’s integration with DataStax Astra DB, the NoSQL and vector database for GenAI, enables developers to build RAG pipelines to quickly convert the most common document types into vector data for highly relevant GenAI similarity searches.
Take a look at this Python tutorial to get started building a RAG pipeline powered by Astra DB Serverless. The code builds an LLM-based query engine and retrieves parsed data to provide contextual insights for users.
The Unstructured Python client library for document parsing is also included in RAG. It’s for enterprises that want a curated, supported, out-of-the-box GenAI stack for enterprise RAG applications, leveraging LangChain and LLamaIndex. For details, see RAG with Unstructured and Astra DB. See below for more common use cases and how to get started.
Unstructured can be used to process a set of reports into vectorized data within Astra DB. By simply pointing to a directory of these files, Unstructured will automatically parse and process the files, generating vector embeddings, and storing the results into Astra DB with minimal user effort. Vectorized data enables powerful downstream use cases, including RAG applications built on top of the data. This enables the use of LLMs to easily access information and answer questions about information stored in proprietary PDF reports. Example code is provided in the Unstructured documentation:
from unstructured.partition.pdf import partition_pdf # Returns a List[Element] present in the pages of the parsed pdf document elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf") # Applies the English and Swedish language pack for ocr. OCR is only applied # if the text is not available in the PDF. elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", languages=["eng", "swe"])
Unstructured with Astra DB can also facilitate a web-scraping framework, pulling data from HTML web pages, parsing them into structured text data, and generating the embeddings for storage into Astra DB. This provides a powerful way to perform a number of important tasks. For example, internal and external documentation pages can be parsed so that developers can build chatbots that enable users to query information from the documentation.
from unstructured.partition.html import partition_html url = "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html" elements = partition_html(url=url) print("\n\n".join([str(el) for el in elements]))
Unstructured includes the ability to handle email messages and perform much of the same processing as shown above. While search is often an effective way to retrieve information, using an LLM to access information within emails can power a wide variety of use cases. For example, developers can ask the LLM to retrieve all emails that include receipts, or that discuss a particular topic.
from unstructured.partition.email import partition_email elements = partition_email(filename="example-docs/fake-email.eml") with open("example-docs/fake-email.eml", "r") as f: elements = partition_email(file=f)
Astra DB users can simplify the processing of PDFs using the new Astra Data Loader. Supporting multiple files and large file sizes, users can now ingest PDFs directly through the Astra DB portal. The Data Loader handles everything else, leveraging Unstructured.io's capabilities to partition and chunk documents. If you’re also using Vectorize, embeddings are automatically generated with your preferred provider. No coding is necessary!
Self-managed Langflow now offers flexible document ingestion with an Unstructured component. Upload a variety of file types, including PDFs, images, videos, Word documents, and PowerPoint presentations. The integration supports both the Unstructured serverless API and local Unstructured installations for simple document ingestion within your Langflow flows.
Unstructured connects enterprise data to LLMs, no matter the source. The platform effortlessly extracts and transforms complex data for use with vector databases and LLM frameworks.
The Astra DB vector database gives developers a familiar, intuitive Data API for vector and structured data types, and all the ecosystem integrations required to deliver production-ready generative AI applications on any infrastructure with unlimited scale.
Unstructured transforms data by extracting it from a source document file, partitions it for cleaning, performs chunking and metadata generation, and then renders the results into a normalized JSON format. Vector embeddings are generated via a number of supported model hosts (including Hugging Face, AWS Bedrock, and OpenAI). Upon completion of the pipelines, the data is written to Astra DB.
The goal of Unstructured is to take any form of unstructured data (PDFs, emails, word docs, webpages, etc) and convert them into a structured format that can be neatly parsed, vector embeddings can be generated, and then downstream tools like LlamaIndex or LangChain can be used to build apps on top of that data. There are a wide variety of cases where Unstructured can be used, but it comes in particularly useful when building an AI pipeline that references unstructured data rather than something that's already nicely formatted as, for example, a CSV or Excel sheet.
Unstructured’s open-source library for Python is free, but there is also a platform and API service that are fee-based.
The open-source version of Unstructured can be installed with:
pip install "unstructured[all-docs]"
See more detailed installation instructions in the GitHub readme.