feat: Add PDF Loader to Document Loaders in Rig #24

Tachikoma000 · 2024-09-18T22:14:39Z

I have looked for existing issues (including closed) about this

Feature Request

Add a PDF Loader to the Document Loaders in Rig

Motivation

PDF is a widely used format for storing and sharing documents. Adding support for loading PDF files would significantly enhance Rig's capability to process and analyze a broader range of document types. This feature would allow users to easily incorporate PDF documents into their RAG (Retrieval-Augmented Generation) systems and other LLM tasks.

Use cases include:

Extracting information from technical documentation stored in PDFs
Analyzing academic papers and research reports
Processing business documents and reports
Incorporating legal documents into NLP workflows

Proposal

Implement a PdfLoader as part of the document_loaders module. The implementation should:

Create a new file src/document_loaders/pdf.rs
Implement a PdfLoader struct that implements the DocumentLoader trait
Use the lopdf crate for parsing PDF files
Extract text content from PDF documents
Convert extracted content into DocumentEmbeddings

Basic structure:

use async_trait::async_trait;
use lopdf::Document;
use crate::embeddings::DocumentEmbeddings;
use super::DocumentLoader;

pub struct PdfLoader {
    path: String,
}

impl PdfLoader {
    pub fn new(path: &str) -> Self {
        Self { path: path.to_string() }
    }
}

#[async_trait]
impl DocumentLoader for PdfLoader {
    async fn load(&self) -> Result<Vec<DocumentEmbeddings>, Box<dyn std::error::Error + Send + Sync>> {
        // Implementation here
    }
}

Additional considerations:

Handle potential errors gracefully (e.g., file not found, parsing errors)
Implement chunking strategies for large PDFs
Ensure proper encoding handling for various PDF text encodings
Add unit tests for the PdfLoader
Update documentation to include usage examples

Alternatives

Use a different PDF parsing library: We could use libraries like pdf-extract or pdf-rs instead of lopdf. However, lopdf seems to offer a good balance of features and performance.
Implement PDF parsing from scratch: This would give us more control but would be time-consuming and potentially error-prone.
Use external tools: We could use external command-line tools like pdftotext and call them from Rust. This would be simpler to implement but would add external dependencies and potential security risks.
Defer PDF support to users: We could provide a trait for document loading and let users implement PDF support themselves. This would be simpler for Rig but would push complexity to the users.

The proposed solution (using lopdf) was chosen because it provides a good balance of functionality, ease of implementation, and integration with Rust. It keeps the implementation within Rig, providing a cohesive experience for users without external dependencies.

The text was updated successfully, but these errors were encountered:

cvauclair · 2024-10-28T17:38:11Z

Closed by #55

Tachikoma000 self-assigned this Sep 18, 2024

This was referenced Sep 18, 2024

feat: Add PDF loader to document loaders #25

Closed

feat(loaders): document loader pdf #26

Closed

mateobelanger added the feat label Oct 15, 2024

cvauclair closed this as completed Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add PDF Loader to Document Loaders in Rig #24

feat: Add PDF Loader to Document Loaders in Rig #24

Tachikoma000 commented Sep 18, 2024

cvauclair commented Oct 28, 2024

feat: Add PDF Loader to Document Loaders in Rig #24

feat: Add PDF Loader to Document Loaders in Rig #24

Comments

Tachikoma000 commented Sep 18, 2024

Feature Request

Motivation

Proposal

Alternatives

cvauclair commented Oct 28, 2024