feat(loaders): loaders for files and pdfs #55

0xMochan · 2024-10-10T14:54:49Z

This PR adds loader structs that help load files from the disk. It also adds an optional dependency, lopdf, for working w/ PDFs.

Implementation

FileLoader
PdfLoader

Both of these structs implement a typestate pattern that enforces state ransitions to happen in a specific order (defined by a state tree). This means, once a glob has been processed, you can only iterate on those files by reading them, or applying other specific methods like ignore_errors until you end with an iter to finally output the iterator.

`iterator` versus `iter_generator`

It was impossible to go the iter_generator route due to lifetimes. Since the ignore_errors method introduced a lifetime, adding a function that generates an iterator would have resulting in adding the 'static lifetime which seems like the wrong route.

Traits / Reusability

The traits defining the main *Loader methods we removed since it required the user to import those traits in order to use the API. This also resulted in some duplication amongst the FileLoader and PdfLoader API but one way around this is a 2 layer setup (similar to how Readable and Loadable is currently setup) as it allows for reusable code w/o relying on global traits. I'm curious on whether I should apply this to everything.

PDF specific

Lopdf keeps the entire PDF in memory so when we return data that open the PDFs and chunk it by pages, we return a Vec<(usize, String)> since the data needs to be owned (it's impossible to return a nested iterator afaik).
Lodpf currently has some bugs with some PDFs that I haven't figured out.
- The API makes it a little tricky to read an entire page or read the entire PDF due to how insane PDFs as a file type are.

`Readable` reusability

I do reuse Readable (bad name lol) but the Error type is hardcoded making it really awkward to use in pdf.rs. I tried to add the type Error; to the trait but when Readable implements Result<PathBuf, FileLoaderError>, I can't use that type since it's inside the implementation. If I use generics, it's also not possible due to some fringe type error.

0xMochan · 2024-10-17T15:27:03Z

TODO: Clippy fails due to some eluded lifetimes

cvauclair · 2024-10-18T13:14:59Z

@0xMochan can you merge main into your branch? There are some updates to the CI pipeline necessary to support feature flags

cvauclair · 2024-10-18T14:47:13Z

@0xMochan seems like there are a lot of fmt/clippy errors still (unrelated to lifetimes), could you resolve those before I start my review? Thanks!

cvauclair

Couple of things for both loaders:

We should look into splitting the new constructor into with_glob and with_dir. Playing with it right now and I'm noticing that, for example:

let loader = FileLoader::new("my-dir/")?;

does not work, but using "my-dir/*" does work, which is not very clear.
2. Implement IntoIterator see my detailed comment on file.rs
3. Docstrings!!!
4. An example or two would go a long way towards evaluating the DevEx of the feature. For the FileLoader one, a trivial self-contained example could simply load the Cargo.toml from the workspace.

rig-core/src/loaders/file.rs

rig-core/src/loaders/pdf.rs

0xMochan · 2024-10-23T16:54:28Z

Updates

Every docstring should be defined on the important loader methods.
examples/loaders.rs, and examples/agent_with_loaders.rs for showing how to use loaders
Test-case for both FileLoader and PDFFileLoader`
- Should I include more of these?
Properly implements IntoIterator with into_iter

Questions / Concerns

lopdf parses the PDFs by outputting \ns where spaces exists.
The extracting of text from lopdf involves iterating through all of the object IDs and using extract_test by page number rather than using the object id system from lopdf.
Normal .iter and .iter_mut is not implemented due to the nature of how the iterator is owned. There's no way to call .next to return borrowed data when the data needs to be owned.
- Or I'm unsure how to go about it, since iterator is boxed.

cvauclair

Looking good! Couple things to change in the docstrings, as well as some more open questions/comments.

rig-core/src/loaders/file.rs

rig-core/src/loaders/pdf.rs

rig-core/src/loaders/file.rs

…rator usage

0xMochan force-pushed the feat/file-loaders branch from b947112 to 999d724 Compare October 10, 2024 19:23

This was referenced Oct 14, 2024

feat(loaders): CSV Loader to Document Loaders #30

Closed

feat(loaders): document loader pdf #26

Closed

0xMochan requested a review from cvauclair October 16, 2024 14:26

0xMochan changed the title ~~feat: loaders for files and pdfs~~ feat(loaders): loaders for files and pdfs Oct 17, 2024

0xMochan marked this pull request as ready for review October 17, 2024 15:26

0xMochan added 5 commits October 18, 2024 09:28

feat: loaders for files and pdfs

b182ada

fix: latest implementations of loaders

c8e0d7a

fix: make pdf optional

6153e24

fix: clippy pt1

bc7e9fd

fix: adding cfg test

74f32e6

0xMochan force-pushed the feat/file-loaders branch from 657c68b to 74f32e6 Compare October 18, 2024 14:28

0xMochan added 3 commits October 18, 2024 11:00

fix: clippy

82fdee4

fix: remedy tests

f96abd0

test: file loader test

5b00932

cvauclair requested changes Oct 21, 2024

View reviewed changes

rig-core/src/loaders/file.rs Outdated Show resolved Hide resolved

rig-core/src/loaders/file.rs Show resolved Hide resolved

rig-core/src/loaders/file.rs Outdated Show resolved Hide resolved

rig-core/src/loaders/pdf.rs Show resolved Hide resolved

fix: several pdf errors

3429b4a

mateobelanger assigned 0xMochan Oct 23, 2024

0xMochan added 3 commits October 23, 2024 10:59

test(loaders): finish test for PDfFileLoader

83defee

docs(examples): agent_with_loader example

8f719df

fix(loaders): remove iter and iter_mut since they don't work here

cc58ed3

0xMochan requested a review from cvauclair October 23, 2024 16:54

fix(loaders): merge conflict

1ef9e57

cvauclair requested changes Oct 24, 2024

View reviewed changes

0xMochan added 2 commits October 24, 2024 16:01

docs(loaders): improve docstrings, visibility on internal traits, ite…

04d1fb5

…rator usage

docs(loaders): fix error PdfLoaderError

3611a25

0xMochan requested a review from cvauclair October 28, 2024 17:11

cvauclair approved these changes Oct 28, 2024

View reviewed changes

cvauclair merged commit 208ba24 into main Oct 28, 2024
4 checks passed

cvauclair mentioned this pull request Oct 28, 2024

feat: Add PDF Loader to Document Loaders in Rig #24

Closed

1 task

This was referenced Oct 28, 2024

chore: release #80

Closed

chore: release #87

Merged

0xMochan deleted the feat/file-loaders branch November 22, 2024 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loaders): loaders for files and pdfs #55

feat(loaders): loaders for files and pdfs #55

0xMochan commented Oct 10, 2024 •

edited

Loading

0xMochan commented Oct 17, 2024

cvauclair commented Oct 18, 2024

cvauclair commented Oct 18, 2024

cvauclair left a comment •

edited

Loading

0xMochan commented Oct 23, 2024

cvauclair left a comment

feat(loaders): loaders for files and pdfs #55

feat(loaders): loaders for files and pdfs #55

Conversation

0xMochan commented Oct 10, 2024 • edited Loading

Implementation

iterator versus iter_generator

Traits / Reusability

PDF specific

Readable reusability

0xMochan commented Oct 17, 2024

cvauclair commented Oct 18, 2024

cvauclair commented Oct 18, 2024

cvauclair left a comment • edited Loading

Choose a reason for hiding this comment

0xMochan commented Oct 23, 2024

Updates

Questions / Concerns

cvauclair left a comment

Choose a reason for hiding this comment

0xMochan commented Oct 10, 2024 •

edited

Loading

`iterator` versus `iter_generator`

`Readable` reusability

cvauclair left a comment •

edited

Loading