Original PDF's #9

PeterStaar-IBM · 2024-12-19T16:16:34Z

Dear OmniDocBench team,

Great work! I had a question: would it be possible to also include the original PDF pages (not their jpg representations)?

ouyanglinke · 2024-12-20T03:43:08Z

We are currently using image-based PDFs for our evaluations, which do not retain the original PDF meta information. We understand that some document parsing pipelines utilize the meta information for parsing (e.g., using PyMuPDF). However, to ensure a fair comparison with large models and other purely visual parsing tools, we have not used PDFs with meta information.

If you need image-based PDFs for evaluation, we will soon provide a code tool for converting images to PDFs.

If you require PDFs that retain the meta information, we do not currently offer this. If needed, we may consider providing a download link in the future. However, this work will take some time because many PDFs are collected from various sources, and a lot of information has been lost. Tracing the original PDFs requires time, and it might still be impossible to find the original versions of some PDFs.

Additionally, to better understand user needs, we would like to know the reasons for requiring the original PDFs if possible. This will help us improve our efforts in this work.

PeterStaar-IBM · 2024-12-20T11:38:49Z

@ouyanglinke I understand that, for consistent OCR evaluation, once should use the image-based documents. However, in many real life cases, the PDF's are provided (and not the page-images).

Hence, I think it would be great to have the exact original PDF page linked to the image. I would assume that this would not be very hard, since you anyway need to have them to create the page-image and it would make the dataset mopre complete. Additionally, we could also compare how the OCR would work in comparison with parsing the native PDF.

ouyanglinke · 2024-12-20T12:33:43Z

Thank you for your suggestion. Indeed, comparing the accuracy differences between PDFs containing meta information and image-based PDFs on the same tool would be a very interesting result. We will provide the link to the PDF dataset as soon as possible in the near future, and we will get back to you at that time.

PeterStaar-IBM · 2024-12-20T13:21:02Z

@ouyanglinke That would be awesome!

ouyanglinke · 2024-12-25T10:56:59Z

Good News! We have uploaded the original PDFs to Huggingface. Be aware there is no mask in original PDFs, and it will bring differences with the image-based PDFs used for evaluation. Details see README of Dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Original PDF's #9

Original PDF's #9

PeterStaar-IBM commented Dec 19, 2024

ouyanglinke commented Dec 20, 2024

PeterStaar-IBM commented Dec 20, 2024

ouyanglinke commented Dec 20, 2024

PeterStaar-IBM commented Dec 20, 2024

ouyanglinke commented Dec 25, 2024

Original PDF's #9

Original PDF's #9

Comments

PeterStaar-IBM commented Dec 19, 2024

ouyanglinke commented Dec 20, 2024

PeterStaar-IBM commented Dec 20, 2024

ouyanglinke commented Dec 20, 2024

PeterStaar-IBM commented Dec 20, 2024

ouyanglinke commented Dec 25, 2024