Replies: 3 comments
-
Please be specific: against which block on which page do you have objections?
If you are looking for text in the right (reading) sequence, do not rely on blocks, but go down at least one hierarchy level, i.e. lines or wven text spans. |
Beta Was this translation helpful? Give feedback.
-
I am going to transfer this to Discussions. I reviewed the example file I am failing to see any issue. |
Beta Was this translation helpful? Give feedback.
-
Oh yes.. i am so sorry. This is for last page of the pdf where there is extra block below the 2 columnar text blocks. Sorry about that. Yes focus is on extracting blocks. So..If that bottom block is on top, it detects exactly how would we visually imagine and separated block is detected with additional block for each column. Works exactly how we prefer it. However, when that block is at bottom, which is the case here, it is mixing with one of the column block. Is it possible to apply the logic to block found at bottom , which is utilized when block is found at top ? Thanks for your help. |
Beta Was this translation helpful? Give feedback.
-
Description of the bug
It seems block detection has some issue in certain cases.
I am trying to extract the blocks (not text) and it seems it isn't detecting block boundary coordinates correctly.
pip list -o confirms that version is latest.
Attaching files here with :- Source pdf , Output pdf as well as code (as .txt file).
output_check_last_page_for_issue.pdf
source.pdf
block_detection_issue_on_last_page.py.txt
Thanks for your help.
-M
How to reproduce the bug
To reproduce this issue, please see attached files :- source pdf , output pdf as well as code.
To run the code rename the .txt file to python file:- block_detection_issue_on_last_page.py
and run .....
$ python block_detection_issue_on_last_page.py source.pdf output_new.pdf
PyMuPDF version
1.25.2
Operating system
Linux
Python version
3.12
Beta Was this translation helpful? Give feedback.
All reactions