Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page Streaming? #2

Open
xdave opened this issue Jan 21, 2025 · 0 comments
Open

Page Streaming? #2

xdave opened this issue Jan 21, 2025 · 0 comments

Comments

@xdave
Copy link

xdave commented Jan 21, 2025

Hi again, I have sometimes very large PDF documents (sometimes 1200+ pages) to convert into markdown. In my current setup (using another parser), I'm using pypdfium2 to split out a single page at a time and then I'm passing it to docling.

In this setup, I'm carefully managing buffers so that a potentially massive document doesn't cause OOM, and for each page that gets passed in, I return the results lazily using yield, and a consuming function then streams these things back to where it needs to go.

How difficult would it be to support something like this in pdf2markdown4llm, where, perhaps while analyzing, extracting, and converting, the markdown results of a single page could be streamed back to the caller, similarly?

Would the requirements of the analysis process be to rigid to support this? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant