Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Original PDF's #9

Open
PeterStaar-IBM opened this issue Dec 19, 2024 · 5 comments
Open

Original PDF's #9

PeterStaar-IBM opened this issue Dec 19, 2024 · 5 comments

Comments

@PeterStaar-IBM
Copy link

Dear OmniDocBench team,

Great work! I had a question: would it be possible to also include the original PDF pages (not their jpg representations)?

@ouyanglinke
Copy link
Collaborator

We are currently using image-based PDFs for our evaluations, which do not retain the original PDF meta information. We understand that some document parsing pipelines utilize the meta information for parsing (e.g., using PyMuPDF). However, to ensure a fair comparison with large models and other purely visual parsing tools, we have not used PDFs with meta information.

If you need image-based PDFs for evaluation, we will soon provide a code tool for converting images to PDFs.

If you require PDFs that retain the meta information, we do not currently offer this. If needed, we may consider providing a download link in the future. However, this work will take some time because many PDFs are collected from various sources, and a lot of information has been lost. Tracing the original PDFs requires time, and it might still be impossible to find the original versions of some PDFs.

Additionally, to better understand user needs, we would like to know the reasons for requiring the original PDFs if possible. This will help us improve our efforts in this work.

@PeterStaar-IBM
Copy link
Author

@ouyanglinke I understand that, for consistent OCR evaluation, once should use the image-based documents. However, in many real life cases, the PDF's are provided (and not the page-images).

Hence, I think it would be great to have the exact original PDF page linked to the image. I would assume that this would not be very hard, since you anyway need to have them to create the page-image and it would make the dataset mopre complete. Additionally, we could also compare how the OCR would work in comparison with parsing the native PDF.

@ouyanglinke
Copy link
Collaborator

Thank you for your suggestion. Indeed, comparing the accuracy differences between PDFs containing meta information and image-based PDFs on the same tool would be a very interesting result. We will provide the link to the PDF dataset as soon as possible in the near future, and we will get back to you at that time.

@PeterStaar-IBM
Copy link
Author

@ouyanglinke That would be awesome!

@ouyanglinke
Copy link
Collaborator

Good News! We have uploaded the original PDFs to Huggingface. Be aware there is no mask in original PDFs, and it will bring differences with the image-based PDFs used for evaluation. Details see README of Dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants