Why the OCR data is not consistent with the paper? #11

Road2Redemption · 2025-02-02T10:28:45Z

In the paper you claimed that there are 50K web_cc OCR data for each language, so that there should be 500K data in total, but the released version of PangeaInstruct have only 300K data in total, is there a size mismatch? Or there are data that you keep as private?
Thank you!

yueqis · 2025-02-02T16:57:11Z

Hi, thanks for pointing this out! We only used 300k OCR data in training so the previously released data contain only 300k OCR data. We now uploaded the 500k data to huggingface, please check: https://huggingface.co/datasets/neulab/PangeaInstruct/blob/main/ocr/webui_multilingual_ocr/data-500k.json

Road2Redemption · 2025-02-04T09:10:20Z

Thank you very much!

yueqis closed this as completed Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the OCR data is not consistent with the paper? #11

Why the OCR data is not consistent with the paper? #11

Road2Redemption commented Feb 2, 2025

yueqis commented Feb 2, 2025

Road2Redemption commented Feb 4, 2025

Why the OCR data is not consistent with the paper? #11

Why the OCR data is not consistent with the paper? #11

Comments

Road2Redemption commented Feb 2, 2025

yueqis commented Feb 2, 2025

Road2Redemption commented Feb 4, 2025