TechnologySeptember 18, 2024

How to Chunk Text in JavaScript for Your RAG Application

How to Chunk Text in JavaScript for Your RAG Application

Retrieval-augmented generation (RAG) applications begin with data, so getting your data in the right shape to work well with vector databases and large language models (LLMs) is the first challenge you’re likely to face when you get started building. In this post, we'll discuss the different ways to work with text data in JavaScript, exploring how to split it up into chunks and prepare it for use in a RAG app.

Why chunking is important

Often you will have swaths of unstructured text content that need to be broken down into smaller chunks of data. These chunks of text are turned into vector embeddings and stored in a vector database like Astra DB.

Compared to a whole document, smaller chunks of text capture fewer topics or ideas, which means that their embeddings will contain more focused meaning. This makes each chunk easier to match against incoming user queries and makes for more accurate retrieval. When you improve your retrieval, you can feed fewer—but more relevant—tokens to an LLM and create a more accurate and useful RAG system.

If you want to read more on the theory behind text chunking check out this post that breaks down the nature of chunking or this one from Unstructured on best practices for chunking.

Chunking in JavaScript

Let's move beyond theory and take a look at the options in JavaScript to chunk your data. The libraries we're going to look at are: llm-chunk, LangChain, LlamaIndex, and semantic-chunking. You can experiment with these chunking libraries using this app I put together.

We'll also take a look at using the Unstructured API for more complex use cases. Let's get started with the simplest of these modules.

llm-chunk

llm-chunk describes itself as a "super simple and easy-to-use text splitter" and it is! You can install it using npm with

npm install llm-chunk

It consists of one function, `chunk`, and it takes just a few options. You can choose maximum and minimum sizes for the chunks it produces, pick the size of the overlap between chunks, and choose a strategy for splitting the text, either by sentence or paragraph.

By default, it will split text up by paragraphs with a maximum length of 1,000 characters, and no minimum length or overlap.

import { chunk } from 'llm-chunk'
const text = loadText(); // get some text to split
const chunks = chunk(text);

You can choose to split the text by sentences, alter the maximum and minimum number of characters by chunk, or by how many characters each chunk should overlap.

const chunks = chunk(text, { minLength: 128, maxLength: 1024, overlap: 128 });

For more complex use cases, you can opt to parse a set of delimiters. These get turned into a regular expression and used as the basis for splitting up the text instead of just by paragraph or sentence. llm-chunk is simple, fast, and a great option when you are starting your RAG journey

LangChain

LangChain is much more than a text splitter. It is a library that helps you load data, split it up, embed it, store that data in and retrieve it from vector databases (including Astra DB), feed it as a prompt to LLMs, and more. There’s a lot to explore with LangChain, but we’re going to concentrate on text splitting.

You can install the LangChain text splitters with this command:

npm install @langchain/textsplitters

If you are using the main langchain module then @langchain/textsplitters is included as a dependency.

LangChain includes three main splitter classes, the CharacterTextSplitter, RecursiveCharacterTextSplitter, and TokenTextSplitter. Let's take a look at what each of them do.

CharacterTextSplitter

This is the simplest of the splitters provided by LangChain. It splits up a document by a character, then merges segments back together until they reach the desired chunk size, overlapping the chunks by the desired number of characters.

The default character for splitting up text is "\n\n". This means it aims to initially split the text by paragraphs, though you can also provide the character you want to split by too. Here's how you would use this LangChain splitter:

import { CharacterTextSplitter } from "@langchain/textsplitters";
const text = loadText(); // get some text to split
const splitter = new CharacterTextSplitter({
  chunkSize: 1024,
  chunkOverlap: 128
});
const output = await splitter.splitText(text);

You can also output LangChain Documents which is useful if you are using the rest of LangChain to create a data pipeline:

const output = await splitter.createDocuments([text]);

RecursiveCharacterTextSplitter

The CharacterTextSplitter is naive, and doesn't take into account much of the structure of a piece of text. The RecursiveCharacterTextSplitter goes beyond by using a list of separators to progressively break text down until it creates chunks that fit the size you want. By default it splits text first by paragraphs, then sentences, then words.

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const text = loadText(); // get some text to split
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1024,
  chunkOverlap: 128
});
const output = splitter.splitText(text);

You can provide different characters for the RecursiveCharacterTextSplitter to split the text on so it can be used to split up other types of text. There are two classes available that make it easy to split up Markdown or Latex, the MarkdownTextSplitter and LatexTextSplitter respectively.

But this can also be used to split up code for a number of different languages. For example, if you wanted to split up a JavaScript file, you could do this:

const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
  chunkSize: 300,
  chunkOverlap: 0,
});
const jsOutput = await splitter.splitText(jsCode);

The RecursiveCharacterTextSplitter is a versatile text splitter and is likely a good first stop when you're building up a pipeline to ingest data for your RAG application.

TokenTextSplitter

Taking a different approach, the TokenTextSplitter turns the text first into tokens using js-tiktoken, splits the tokens into chunks and then converts the tokens back into text.

Tokens are the way that LLMs consume content, a token can be a whole word or just a part of it. OpenAI has a good representation of how their models break text into tokens. Yo can use the TokenTextSplitter like this:

import { TokenTextSplitter } from "@langchain/textsplitters";
const text = loadText(); // get some text to split
const splitter = new TokenTextSplitter({
  chunkSize: 1024,
  chunkOverlap: 128
});
const output = splitter.splitText(text);

LlamaIndex

LlamaIndex is also responsible for much more than just text splitting. We're going to hone in on the splitting capabilities though, which you can use outside of LlamaIndex too.

LlamaIndex considers chunks of a document as Nodes and the rest of the library works with Nodes. There are three available processors: SentenceSplitter, MarkdownNodeParser. and SentenceWindowNodeParser.

You can install the entire LlamaIndex suite with 

npm install llamaindex

If you just want the text parsers, you can install just the LlamaIndex core:

npm install @llamaindex/core

SentenceSplitter

The SentenceSplitter is the simplest of the LlamaIndex splitters. It splits the text into sentences and then combines them into a string that is smaller than the provided chunkSize.

LlamaIndex does this differently from the previous splitters; it measures the size of a chunk in tokens rather than characters. There are approximately four characters to a token, so if you want to use the DataStax suggested default of 1024 characters with an overlap of 128, you should aim for chunks of 256 tokens and an overlap of 32 tokens.

The SentenceSplitter returns an array of chunks like this

import { SentenceSplitter } from "@llamaindex/core/node-parser";
const text = loadText(); // get some text to split
const splitter = new SentenceSplitter({
  chunkSize: 256,
  chunkOverlap: 32,
});
const output = splitter.splitText(text);

MarkdownNodeParser

If you have Markdown to split up then the MarkdownNodeParser will split it up into logical sections based on the headers in the document.

This splitter doesn't let you set a chunkSize or overlap, so you do give up that level of control. It also works over LlamaIndex Documents or Nodes. In this example we turn our text into a Document first, then get the Nodes from the Documents.

import { MarkdownNodeParser } from "@llamaindex/core/node-parser";
import { Document } from "@llamaindex/core/schema";
const text = loadText(); // get some text to split
const splitter = new MarkdownNodeParser();
const nodes = splitter.getNodesFromDocuments([new Document({ text })]);
const output = nodes.map(node => node.text);

SentenceWindowNodeParser

The final LlamaIndex parser breaks text down into sentences and then produces a Node for each sentence with a window of sentences to either side. You can choose how big the window is. Choosing a window size of one will produce Nodes with three sentences, the current sentence, one before and one after. A window size of two produces Nodes with five sentences, the current sentence, and two either side.

This parser works on Documents as well; you use it like so:

import { SentenceWindowNodeParser } from "@llamaindex/core/node-parser";
import { Document } from "@llamaindex/core/schema";
const text = loadText(); // get some text to split
const splitter = new SentenceWindowNodeParser({ windowSize: 3 });
const nodes = splitter.getNodesFromDocuments([new Document({ text })]);
const output = nodes.map(node => node.text);

semantic-chunking

Semantic-chunking is not a popular text splitter in the JavaScript world, but I wanted to include it as something that’s a bit different. It still gives you control over the maximum size of your chunks, but the way it splits up the chunks uses a more interesting method.

It first splits the text into sentences, and then it generates embedding vectors for each sentence using a locally downloaded model (by default it uses Xenova/all-MiniLM-L6-v2 but you can choose a different one if you want). It then groups the sentences into chunks based on how similar they are using cosine similarity. The intention is to group sentences that go together and have related contents, and then, when the topic changes, start a new chunk.

This is a smarter type of chunking than just splitting by chunk size, and likely even smarter than the markdown parsers that at least take section headings into account. The trade-off is that it is likely to be slower as there is more computation to be done.

It is still simple to use though; install it with:

npm install semantic-chunking

You can pass a similarityThreshold to the chunkit function, which is the minimum cosine similarity required for two sentences to be included in the same chunk. A high threshold provides a high bar for a sentence to be included and likely results in smaller chunks. A low threshold will allow for fewer and bigger chunks. As always, it’s worth experimenting with this setting to find what works for your data.

import { chunkit } from "semantic-chunking";
const text = loadText(); // get some text to split
const chunks = chunkit(text, {
  maxTokenSize: 256,
  similarityThreshold: 0.5
});

There are other options around further combining chunks, check out the documentation for more detail.

Unstructured

Unstructured is a platform for extracting data from files and chunking it in a smart way. There is an open source toolkit for this, a no-code platform, and an API. We're going to investigate the API here.

The API has support for extracting data from loads of different file types, including images and documents like PDFs that may contain images or tables of data. Below is a simple example of calling the Unstructured API; you can read about more of the capabilities, particularly around extracting data from PDFs and images, in the Unstructured API documentation.

First you should install the API client:

npm install unstructured-client

You will need an API key. You can get a free API key, or the paid service has a two-week free trial. Once you have an API key, you can use the client to call the API.

In this example, I am chunking text from a markdown file, but check out the API parameters you can use for behaviour with other types of file.

import { UnstructuredClient } from "unstructured-client";
import { ChunkingStrategy } from "unstructured-client/sdk/models/shared/index.js";
import { readFileSync } from "node:fs";

const client = new UnstructuredClient({
  serverURL: "https://api.unstructuredapp.io",
  security: {
    apiKeyAuth: "YOUR_API_KEY_HERE",
  },
});

const data = readFileSync("./post.md");

const res = await client.general.partition({
  partitionParameters: {
    files: {
      content: data,
      fileName: "post.md",
    },
    chunkingStrategy: ChunkingStrategy.BySimilarity,
  },
});
if (res.statusCode == 200) {
  console.log(res.elements);
}

You'll notice I selected a chunking strategy of similarity; you can read about the other chunking strategies available in the documentation as well as partitioning strategies, which are used for documents like images and PDFs. 

What's your chunking strategy?

It turns out that there are many options for splitting text up into chunks in the JavaScript ecosystem. From the quick and simple, like llm-chunk, fully featured libraries with full GenAI pipelines like LangChain or LlamaIndex, and more complex methods like semantic-chunking and the Unstructured API with all the options it brings for loading and splitting documents.

Getting your chunking strategy right is important for getting good results from your RAG application, so ensure you read about the best practices and try out the available options to see the sort of results you get from them. If you want to experiment with the libraries llm-chunk, LangChain, LlamaIndex, and semantic-chunking, check out the example application, Chunkers.

Once you've got your chunking sorted, the next step is turning your chunks into vector embeddings, so check out how to do that with ease using Astra DB Vectorize.

Let me know what you're using to chunk your unstructured data by giving me a shout on Twitter at @philnash or drop me an email at phil.nash@datastax.com.

 

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.