A user highlights the lack of visibility in RAG tools regarding the resulting chunks from chunking strategies. They propose a tool that allows users to inspect and correct chunks before exporting them for ingestion into a vector store.
NVIDIA recently published [an interesting study on chunking strategies](https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/), showing that the choice of chunking method can significantly affect the performance of retrieval-augmented generation (RAG) systems, depending on the domain and the structure of the source documents. However, most RAG tools provide little visibility into what the resulting chunks actually look like. Users typically choose a chunk size and overlap and move on without inspecting the outcome. An earlier step is often overlooked: converting source documents to Markdown. If a PDF is converted incorrectly—producing collapsed tables, merged columns, or broken headings—no chunking strategy can fix those structural errors. The text representation should be validated before splitting. **Chunky** is an open-source local tool designed to address this gap. Its workflow enables users to review the Markdown conversion alongside the original PDF, select a chunking strategy, visually inspect each generated chunk, and directly correct problematic splits before exporting clean JSON ready for ingestion into a vector store. The goal is not to review every document but to solve the template problem. In domains like medicine, law, and finance, documents often follow standardized layouts. By sampling representative files, it’s possible to identify an effective chunking strategy and apply it reliably across the dataset. It integrates LangChain’s text splitter and Chonkie GitHub link: 🐿️ [Chunky](https://github.com/GiovanniPasq/chunky)