Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add PDF Loader to Document Loaders in Rig #24

Closed
1 task done
Tachikoma000 opened this issue Sep 18, 2024 · 1 comment
Closed
1 task done

feat: Add PDF Loader to Document Loaders in Rig #24

Tachikoma000 opened this issue Sep 18, 2024 · 1 comment
Assignees

Comments

@Tachikoma000
Copy link
Contributor

  • I have looked for existing issues (including closed) about this

Feature Request

Add a PDF Loader to the Document Loaders in Rig

Motivation

PDF is a widely used format for storing and sharing documents. Adding support for loading PDF files would significantly enhance Rig's capability to process and analyze a broader range of document types. This feature would allow users to easily incorporate PDF documents into their RAG (Retrieval-Augmented Generation) systems and other LLM tasks.

Use cases include:

  1. Extracting information from technical documentation stored in PDFs
  2. Analyzing academic papers and research reports
  3. Processing business documents and reports
  4. Incorporating legal documents into NLP workflows

Proposal

Implement a PdfLoader as part of the document_loaders module. The implementation should:

  1. Create a new file src/document_loaders/pdf.rs
  2. Implement a PdfLoader struct that implements the DocumentLoader trait
  3. Use the lopdf crate for parsing PDF files
  4. Extract text content from PDF documents
  5. Convert extracted content into DocumentEmbeddings

Basic structure:

use async_trait::async_trait;
use lopdf::Document;
use crate::embeddings::DocumentEmbeddings;
use super::DocumentLoader;

pub struct PdfLoader {
    path: String,
}

impl PdfLoader {
    pub fn new(path: &str) -> Self {
        Self { path: path.to_string() }
    }
}

#[async_trait]
impl DocumentLoader for PdfLoader {
    async fn load(&self) -> Result<Vec<DocumentEmbeddings>, Box<dyn std::error::Error + Send + Sync>> {
        // Implementation here
    }
}

Additional considerations:

  • Handle potential errors gracefully (e.g., file not found, parsing errors)
  • Implement chunking strategies for large PDFs
  • Ensure proper encoding handling for various PDF text encodings
  • Add unit tests for the PdfLoader
  • Update documentation to include usage examples

Alternatives

  1. Use a different PDF parsing library: We could use libraries like pdf-extract or pdf-rs instead of lopdf. However, lopdf seems to offer a good balance of features and performance.

  2. Implement PDF parsing from scratch: This would give us more control but would be time-consuming and potentially error-prone.

  3. Use external tools: We could use external command-line tools like pdftotext and call them from Rust. This would be simpler to implement but would add external dependencies and potential security risks.

  4. Defer PDF support to users: We could provide a trait for document loading and let users implement PDF support themselves. This would be simpler for Rig but would push complexity to the users.

The proposed solution (using lopdf) was chosen because it provides a good balance of functionality, ease of implementation, and integration with Rust. It keeps the implementation within Rig, providing a cohesive experience for users without external dependencies.

@cvauclair
Copy link
Contributor

Closed by #55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants