Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(loaders): loaders for files and pdfs #55

Merged
merged 15 commits into from
Oct 28, 2024
Merged

Conversation

0xMochan
Copy link
Contributor

@0xMochan 0xMochan commented Oct 10, 2024

This PR adds loader structs that help load files from the disk. It also adds an optional dependency, lopdf, for working w/ PDFs.

Implementation

  • FileLoader
  • PdfLoader

Both of these structs implement a typestate pattern that enforces state ransitions to happen in a specific order (defined by a state tree). This means, once a glob has been processed, you can only iterate on those files by reading them, or applying other specific methods like ignore_errors until you end with an iter to finally output the iterator.

iterator versus iter_generator

It was impossible to go the iter_generator route due to lifetimes. Since the ignore_errors method introduced a lifetime, adding a function that generates an iterator would have resulting in adding the 'static lifetime which seems like the wrong route.

Traits / Reusability

The traits defining the main *Loader methods we removed since it required the user to import those traits in order to use the API. This also resulted in some duplication amongst the FileLoader and PdfLoader API but one way around this is a 2 layer setup (similar to how Readable and Loadable is currently setup) as it allows for reusable code w/o relying on global traits. I'm curious on whether I should apply this to everything.

PDF specific

  • Lopdf keeps the entire PDF in memory so when we return data that open the PDFs and chunk it by pages, we return a Vec<(usize, String)> since the data needs to be owned (it's impossible to return a nested iterator afaik).
  • Lodpf currently has some bugs with some PDFs that I haven't figured out.
    • The API makes it a little tricky to read an entire page or read the entire PDF due to how insane PDFs as a file type are.

Readable reusability

I do reuse Readable (bad name lol) but the Error type is hardcoded making it really awkward to use in pdf.rs. I tried to add the type Error; to the trait but when Readable implements Result<PathBuf, FileLoaderError>, I can't use that type since it's inside the implementation. If I use generics, it's also not possible due to some fringe type error.

@0xMochan 0xMochan requested a review from cvauclair October 16, 2024 14:26
@0xMochan 0xMochan changed the title feat: loaders for files and pdfs feat(loaders): loaders for files and pdfs Oct 17, 2024
@0xMochan 0xMochan marked this pull request as ready for review October 17, 2024 15:26
@0xMochan
Copy link
Contributor Author

TODO: Clippy fails due to some eluded lifetimes

@cvauclair
Copy link
Contributor

@0xMochan can you merge main into your branch? There are some updates to the CI pipeline necessary to support feature flags

@cvauclair
Copy link
Contributor

@0xMochan seems like there are a lot of fmt/clippy errors still (unrelated to lifetimes), could you resolve those before I start my review? Thanks!

Copy link
Contributor

@cvauclair cvauclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things for both loaders:

  1. We should look into splitting the new constructor into with_glob and with_dir. Playing with it right now and I'm noticing that, for example:
let loader = FileLoader::new("my-dir/")?;

does not work, but using "my-dir/*" does work, which is not very clear.
2. Implement IntoIterator see my detailed comment on file.rs
3. Docstrings!!!
4. An example or two would go a long way towards evaluating the DevEx of the feature. For the FileLoader one, a trivial self-contained example could simply load the Cargo.toml from the workspace.

@0xMochan
Copy link
Contributor Author

Updates

  • Every docstring should be defined on the important loader methods.
  • examples/loaders.rs, and examples/agent_with_loaders.rs for showing how to use loaders
  • Test-case for both FileLoader and PDFFileLoader`
    • Should I include more of these?
  • Properly implements IntoIterator with into_iter

Questions / Concerns

  • lopdf parses the PDFs by outputting \ns where spaces exists.
  • The extracting of text from lopdf involves iterating through all of the object IDs and using extract_test by page number rather than using the object id system from lopdf.
  • Normal .iter and .iter_mut is not implemented due to the nature of how the iterator is owned. There's no way to call .next to return borrowed data when the data needs to be owned.
    • Or I'm unsure how to go about it, since iterator is boxed.

@0xMochan 0xMochan requested a review from cvauclair October 23, 2024 16:54
Copy link
Contributor

@cvauclair cvauclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Couple things to change in the docstrings, as well as some more open questions/comments.

@0xMochan 0xMochan requested a review from cvauclair October 28, 2024 17:11
@cvauclair cvauclair merged commit 208ba24 into main Oct 28, 2024
4 checks passed
This was referenced Oct 28, 2024
@0xMochan 0xMochan deleted the feat/file-loaders branch November 22, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants