You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have looked for existing issues (including closed) about this
Feature Request
Add a PDF Loader to the Document Loaders in Rig
Motivation
PDF is a widely used format for storing and sharing documents. Adding support for loading PDF files would significantly enhance Rig's capability to process and analyze a broader range of document types. This feature would allow users to easily incorporate PDF documents into their RAG (Retrieval-Augmented Generation) systems and other LLM tasks.
Use cases include:
Extracting information from technical documentation stored in PDFs
Analyzing academic papers and research reports
Processing business documents and reports
Incorporating legal documents into NLP workflows
Proposal
Implement a PdfLoader as part of the document_loaders module. The implementation should:
Create a new file src/document_loaders/pdf.rs
Implement a PdfLoader struct that implements the DocumentLoader trait
Handle potential errors gracefully (e.g., file not found, parsing errors)
Implement chunking strategies for large PDFs
Ensure proper encoding handling for various PDF text encodings
Add unit tests for the PdfLoader
Update documentation to include usage examples
Alternatives
Use a different PDF parsing library: We could use libraries like pdf-extract or pdf-rs instead of lopdf. However, lopdf seems to offer a good balance of features and performance.
Implement PDF parsing from scratch: This would give us more control but would be time-consuming and potentially error-prone.
Use external tools: We could use external command-line tools like pdftotext and call them from Rust. This would be simpler to implement but would add external dependencies and potential security risks.
Defer PDF support to users: We could provide a trait for document loading and let users implement PDF support themselves. This would be simpler for Rig but would push complexity to the users.
The proposed solution (using lopdf) was chosen because it provides a good balance of functionality, ease of implementation, and integration with Rust. It keeps the implementation within Rig, providing a cohesive experience for users without external dependencies.
The text was updated successfully, but these errors were encountered:
Feature Request
Add a PDF Loader to the Document Loaders in Rig
Motivation
PDF is a widely used format for storing and sharing documents. Adding support for loading PDF files would significantly enhance Rig's capability to process and analyze a broader range of document types. This feature would allow users to easily incorporate PDF documents into their RAG (Retrieval-Augmented Generation) systems and other LLM tasks.
Use cases include:
Proposal
Implement a
PdfLoader
as part of thedocument_loaders
module. The implementation should:src/document_loaders/pdf.rs
PdfLoader
struct that implements theDocumentLoader
traitlopdf
crate for parsing PDF filesDocumentEmbeddings
Basic structure:
Additional considerations:
PdfLoader
Alternatives
Use a different PDF parsing library: We could use libraries like
pdf-extract
orpdf-rs
instead oflopdf
. However,lopdf
seems to offer a good balance of features and performance.Implement PDF parsing from scratch: This would give us more control but would be time-consuming and potentially error-prone.
Use external tools: We could use external command-line tools like
pdftotext
and call them from Rust. This would be simpler to implement but would add external dependencies and potential security risks.Defer PDF support to users: We could provide a trait for document loading and let users implement PDF support themselves. This would be simpler for Rig but would push complexity to the users.
The proposed solution (using
lopdf
) was chosen because it provides a good balance of functionality, ease of implementation, and integration with Rust. It keeps the implementation within Rig, providing a cohesive experience for users without external dependencies.The text was updated successfully, but these errors were encountered: