Effective Chunking: Maximize your Embedding AI Models.
This article is part of a series related to the topic of RAG (Retrieval-Augmented Generation) optimization. At Kmeleon, we specialize in AI and machine learning solutions designed to optimize and streamline data processes for businesses. One key aspect of maximizing the potential of embedding models is the effective use of chunking. This article explores the concept of chunking, its significance, and various techniques to implement it efficiently, all from the perspective of our experience and expertise at Kmeleon.
What is Chunking?
Chunking is a pre-processing step that involves breaking down texts into smaller, manageable pieces or "chunks." This crucial process defines how much text each vector in your vector database will capture and store. Imagine you're working with a set of books. Chunking could split the text into chapters, paragraphs, sentences, or even individual words. While this might sound simple, the chunking strategy you choose can significantly impact the performance of your vector databases and the outputs of language models.
Context Window and Tokens
In the realm of language models, a "context window" refers to the maximum amount of text (measured in "tokens," which are words, parts of words, or punctuation) that a model can process at one time. Understanding these terms is essential as they determine how text is chunked and processed by the model.
Why Chunking Matters
Let's take the example of building a vector database from a set of books.
Extreme Approach 1: Cataloging each book as one vector. This would help in identifying books like the query but would be inefficient for finding specific information within a book.
Extreme Approach 2: Cataloging each sentence as a vector. This method excels at finding specific concepts but might struggle with broader information like the theme of a chapter or a book.
The choice of chunking strategy depends on your specific use case. The key takeaway is that chunking determines the unit of information stored and retrieved from your database, impacting search results and retrieval-augmented generation (RAG) workflows.
Meeting Model Requirements
Another reason to chunk data is to align with the input constraints of language models, which typically have finite "context windows." For instance, a typical model might handle thousands of "tokens" (words, parts of words, punctuation, etc.) at a time. Given that extensive texts like "The Lord of the Rings," with over 500,000 words, exceed this limit, chunking becomes essential. Chunking is also vital for optimizing RAG, which enhances large language models (LLMs) by providing them with up-to-date data from a database, preventing the generation of outdated or incorrect information.
Balancing Chunk Size
Using chunks that are too small or too large each comes with its own set of challenges:
Small Chunks: Allow more chunks to be included in the LLM's context window but may lack sufficient context, leading to incomplete or unclear information.
Large Chunks: Provide more context but limit the number of chunks that can be included, potentially omitting relevant data and increasing costs.
Finding the right balance is crucial. Now let’s explore practical chunking techniques starting with fixed-size chunking.
Fixed-Size Chunking
Fixed-size chunking involves splitting texts into chunks of a predefined size, such as 100 words or 200 characters. This straightforward method is popular for its simplicity and effectiveness.
Implementations
In fixed-size chunking, texts are divided into chunks of a fixed number of units (words, characters, or tokens). While smaller chunks can be too granular, larger chunks can offer richer information but may become too general and less effective for precise searches.
Variable-Size Chunking
Variable-size chunking uses markers like sentence or paragraph breaks to determine chunk boundaries, allowing the chunk size to be an outcome rather than a predefined parameter. This approach can be particularly useful when dealing with texts of varying structures.
Mixed Strategy
A mixed strategy combines fixed-size and variable-size chunking. For example, you can split text at paragraph markers and then apply a fixed-size filter to merge or split chunks as needed to achieve the desired size.
Considerations for Choosing a Chunking Method
Text per Search Result: Match chunk size to the desired search result length, whether it’s a sentence, paragraph, or other.
Input Query Length: Align chunk size with the typical length of input queries to ensure effective matching.
Database Size: Larger chunks reduce the number of entries and overall database size.
Model Requirements: Ensure chunk sizes fit within the model’s context window for both embedding generation and RAG.
RAG Workflows: Balance the need for context (longer chunks) with the ability to include diverse sources (shorter chunks).
Rule of Thumb
Start with chunk sizes of 100-150 words and adjust based on performance and specific requirements. This flexible approach will help you fine-tune your chunking strategy for optimal results.
Comparative Analysis
To add depth to the discussion, let's compare different chunking strategies in a proposed scenario:
Scenario: Searching for specific medical information in a large database of research papers.
Fixed-Size Chunking: Provides consistent chunk sizes, which can be beneficial for uniform search results. However, it might miss contextual nuances.
Variable-Size Chunking: Offers more context-specific chunks, enhancing the relevance of search results. This method is particularly useful for documents with diverse structures.
Mixed Strategy: Combines the benefits of both fixed and variable chunking, providing balanced and contextually rich search results.
Implementation with Langchain
Here's how you can implement chunking strategies using Langchain’s Text Splitter:
How it Works:
It uses a list of characters to decide where to split the text.
It splits the text step by step using these characters: ["\n\n", "\n", " ", ""].
This means it first tries to split by paragraphs (\n\n), then sentences (\n), then words ( ), and finally, if needed, individual characters (").
The goal is to keep related parts of the text together as much as possible (like paragraphs, sentences, then words).
Details:
Splitting: Based on the list of characters.
Chunk Size: Measured by the number of characters.
%pip install -qU langchain-text-splitters
# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and'
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'
text_splitter.split_text(state_of_the_union)[:2]
['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']
At Kmeleon, we recognize that chunking is a fundamental yet powerful technique that can dramatically enhance the performance of embedding models and vector databases. By carefully selecting and balancing chunk sizes, your organization can ensure that your language models and search systems operate at peak efficiency, delivering accurate and relevant results. Whether through fixed-size, variable-size, or mixed chunking strategies, the right approach can optimize information retrieval and improve the overall effectiveness of your natural language processing applications.
Are you ready to unlock the full potential of your AI and machine learning solutions? Let Kmeleon guide you with our expertise. For personalized consultations on Generative AI and more, visit Kmeleon and discover how we can help you stay ahead in the ever-evolving landscape of AI technology.