The project aims to help Rust developers build text and language-based applications that utilize some kind of documents or text. It is built for developers to chunkify large documents into smaller chunks without using heavy resources.
use chunkr
to split large pdf documents into smaller chunks for LLM training and RAG (Retrieval Augmented Generation) application development.
To add chunkr
to your project and start chunking, use the cargo cli
cargo add chunkr
There are some examples mentioned in the examples
directory. Checkout those to get started.
Clone the repository and run one of the examples from the examples
directory.
git clone https://github.com/d1pankarmedhi/chunkr.git
cd chunkr
Check out these examples to quickly get started:
These are some chunking strategy examples:
- Chunking by words - Chunk your documents/texts by number of words.
- Chunking by characters - Chunk your documents/text by number of characters.
- Chunk PDF document - Chunk your pdf documents by words/characters.
Run them using the cargo command like:
# cargo run --example example-name chunk-size overlap file-path
cargo run --example chunk_document 1000 20 /home/home/Downloads/clean_code.pdf
As an open-source project, we are open to all kinds of contributions, be it through code, documentation, issues, bugs, or even feature suggestions.
Feel free to check out Contribution guide for more details.
This project is licensed under the MIT License - see the LICENSE.md file for details