-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Original PDF's #9
Comments
We are currently using image-based PDFs for our evaluations, which do not retain the original PDF meta information. We understand that some document parsing pipelines utilize the meta information for parsing (e.g., using PyMuPDF). However, to ensure a fair comparison with large models and other purely visual parsing tools, we have not used PDFs with meta information. If you need image-based PDFs for evaluation, we will soon provide a code tool for converting images to PDFs. If you require PDFs that retain the meta information, we do not currently offer this. If needed, we may consider providing a download link in the future. However, this work will take some time because many PDFs are collected from various sources, and a lot of information has been lost. Tracing the original PDFs requires time, and it might still be impossible to find the original versions of some PDFs. Additionally, to better understand user needs, we would like to know the reasons for requiring the original PDFs if possible. This will help us improve our efforts in this work. |
@ouyanglinke I understand that, for consistent OCR evaluation, once should use the image-based documents. However, in many real life cases, the PDF's are provided (and not the page-images). Hence, I think it would be great to have the exact original PDF page linked to the image. I would assume that this would not be very hard, since you anyway need to have them to create the page-image and it would make the dataset mopre complete. Additionally, we could also compare how the OCR would work in comparison with parsing the native PDF. |
Thank you for your suggestion. Indeed, comparing the accuracy differences between PDFs containing meta information and image-based PDFs on the same tool would be a very interesting result. We will provide the link to the PDF dataset as soon as possible in the near future, and we will get back to you at that time. |
@ouyanglinke That would be awesome! |
Good News! We have uploaded the original PDFs to Huggingface. Be aware there is no mask in original PDFs, and it will bring differences with the image-based PDFs used for evaluation. Details see README of Dataset. |
Dear OmniDocBench team,
Great work! I had a question: would it be possible to also include the original PDF pages (not their jpg representations)?
The text was updated successfully, but these errors were encountered: