Marc Mayol
Chonkie and Docling: two complementary approaches to document processing in RAG pipelines

Chonkie and Docling: two complementary approaches to document processing in RAG pipelines

In the world of intelligent text organization and retrieval (for example, in RAG architectures — retrieval-augmented generation), one of the main challenges is preparing documents so that they are “digestible” by the retrieval and generation components. This is where chunking comes into play — that is, splitting text into coherent pieces — but also correctly interpreting the document itself (structure, tables, layout, etc.).

Two recent tools stand out for their distinct yet potentially complementary approaches:

  • Chonkie: specialized in lightweight, modular, and efficient text fragmentation for AI pipelines.
  • Docling: focused on the ingestion and rich representation of multi-format documents, with awareness of structure, layout, tables, OCR, and more.

Below we explore how they work, their key differences, and how to combine them to build more powerful RAG pipelines.


Chonkie: the lightweight hippo that elegantly slices your texts

“CHONK your texts with Chonkie” is the playful motto of this library, which aims to be “no-nonsense” and ultra-lightweight. 👉 https://github.com/chonkie-inc/chonkie

Why does Chonkie exist?

In RAG systems or semantic search engines, one common issue is text fragmentation. Splitting large documents into coherent pieces without losing context is a delicate balance.

Many libraries offer solutions, but they often include unnecessary dependencies or are too heavy. Chonkie focuses only on the essentials: efficient chunking, refinement, and modular export.

According to its benchmarks, the base installation occupies around 15 MB, compared to 80–170 MB in alternative libraries, and its token-based chunking can be up to 33× faster.


Key features

  • Multiple chunkers:

  • TokenChunker: splits by tokens

  • SentenceChunker: splits by sentences

  • RecursiveChunker: hierarchical division

  • SemanticChunker: based on semantic similarity (embeddings)

  • Others: LateChunker, CodeChunker, SlumberChunker, etc.

  • Modular pipeline and refinement: Chain stages: chunking → refinement (overlaps, embeddings) → export.

  • Integrations: Compatible with vector databases like Chroma, Qdrant, Pinecone, pgvector, among others.

  • Multilingual support: Supports over 50 languages, making it easy to work with global content.

  • Chonkie Cloud: In addition to the local version, it offers a cloud service to offload chunking.


Basic example

from chonkie import TokenChunker

chunker = TokenChunker()
chunks = chunker("This is a sample text I want to split into useful pieces.")
for c in chunks:
    print(c.text, c.token_count)

A more complex pipeline example:

from chonkie import Pipeline

pipe = (
    Pipeline()
    .chunk_with("recursive", tokenizer="gpt2", chunk_size=2048, recipe="markdown")
    .chunk_with("semantic", chunk_size=512)
    .refine_with("overlap", context_size=128)
    .refine_with("embeddings", embedding_model="sentence-transformers/all-MiniLM-L6-v2")
)

doc = pipe.run(texts="Your long text here...")
for ch in doc.chunks:
    print(ch.text)

This modular system allows customization of each step according to project needs.


Current limitations

  • Dependency on external models for embeddings and semantic chunking.
  • Still a small community.
  • Maintenance risk: the repository was temporarily closed by its author for legal reasons but was later restored.
  • Some initial learning curve when configuring complex pipelines.

Docling: deep structural understanding of documents

“From document chaos to structured knowledge” 👉 https://github.com/docling-project/docling

What is Docling?

Developed by IBM’s Deep Search team, Docling is an open-source tool focused on converting complex documents (PDF, DOCX, PPTX, HTML, images, etc.) into a structured representation ready for AI.

Its goal is not just text extraction, but understanding the document’s visual and semantic structure: tables, headers, columns, hierarchy, images, and more.


Main capabilities

  • Advanced PDF parsing: detects columns, headers, and reading order.
  • Table extraction using models such as TableFormer.
  • Integrated OCR for scanned documents.
  • Rich document representation: creates DoclingDocument objects with sections, tables, figures, and metadata.
  • Integration with LangChain via DoclingLoader.
  • Support for Markdown, JSON, or custom chunks.

Example with LangChain:

from langchain_community.document_loaders import DoclingLoader

loader = DoclingLoader(file_path="document.pdf", output_format="MARKDOWN")
docs = loader.load()

Use cases

  • Conversion of reports, papers, and presentations into structured text.
  • Preprocessing of documents for QA and RAG.
  • Data extraction from semi-structured documents.
  • Large-scale ingestion of enterprise data (PDFs, scans, internal documents).

⚖ Comparison: Chonkie vs Docling

AspectChonkieDocling
PurposeFragment text for RAG pipelinesConvert complex documents into structured text
Input levelPlain text (already preprocessed)Raw formats (PDF, DOCX, images, etc.)
Structural awarenessDoes not analyze layout or designUnderstands tables, columns, hierarchy, layout
ChunkingMultiple strategies (tokens, sentences, semantic)Has a basic module but focuses on structure
PerformanceVery fast and lightweightHeavier due to vision/OCR models
AI / RAG integrationNative with vector DBs and embeddingsCompatible with LangChain and other loaders
MaintenanceYoung project with emerging communityBacked by IBM, documentation, and research papers
Ideal useProcess clean text and fragment it optimallyIngest and structure complex documents

How to combine them for maximum results

In reality, Chonkie and Docling are not competitors but natural allies.

An ideal hybrid strategy could be:

  1. Use Docling to convert PDFs, DOCX, or scans into a structured representation (DoclingDocument).
  2. Extract the text from relevant sections or paragraphs.
  3. Apply Chonkie on those fragments to optimize chunking (by tokens, semantic, or recursive).
  4. Index the resulting chunks in a vector database for search or augmented retrieval.

This way you get the best of both worlds: deep document structural understanding and efficient fragmentation optimized for language models.


Conclusion

  • Docling is the translator between document chaos and semantic structure.
  • Chonkie is the tuner that turns that clean text into optimal AI-ready fragments.

Combining them allows you to build more accurate, faster, and robust RAG pipelines, especially when working with complex, multi-format documents. If you want to dive deeper into Docling, I have this article where I explain it in detail, or you can also check out the Chonkie article.

Tip: if you’re building your own RAG system, try using Docling for ingestion and Chonkie for chunking. Your model will thank you.