Incorrect block detection #4243

mxav1111 · 2025-01-20T03:34:04Z

mxav1111
Jan 20, 2025

Description of the bug

It seems block detection has some issue in certain cases.

I am trying to extract the blocks (not text) and it seems it isn't detecting block boundary coordinates correctly.

pip list -o confirms that version is latest.

Attaching files here with :- Source pdf , Output pdf as well as code (as .txt file).

output_check_last_page_for_issue.pdf
source.pdf
block_detection_issue_on_last_page.py.txt

Thanks for your help.
-M

How to reproduce the bug

To reproduce this issue, please see attached files :- source pdf , output pdf as well as code.

To run the code rename the .txt file to python file:- block_detection_issue_on_last_page.py
and run .....

$ python block_detection_issue_on_last_page.py source.pdf output_new.pdf

PyMuPDF version

1.25.2

Operating system

Linux

Python version

3.12

JorjMcKie · 2025-01-20T09:12:16Z

JorjMcKie
Jan 20, 2025
Maintainer

Please be specific: against which block on which page do you have objections?
There seems to be a misconception WRT how blocks are determined:

there is no guarantee that text block rectangles are disjoint
it may even happen that text blocks are contained within each other (which is the case here)
text blocks do not follow reading sequence

If you are looking for text in the right (reading) sequence, do not rely on blocks, but go down at least one hierarchy level, i.e. lines or wven text spans.

0 replies

JorjMcKie · 2025-01-20T13:40:50Z

JorjMcKie
Jan 20, 2025
Maintainer

I am going to transfer this to Discussions. I reviewed the example file I am failing to see any issue.

0 replies

mxav1111 · 2025-01-25T05:12:42Z

mxav1111
Jan 25, 2025
Author

Oh yes.. i am so sorry. This is for last page of the pdf where there is extra block below the 2 columnar text blocks. Sorry about that.

Yes focus is on extracting blocks.

So..If that bottom block is on top, it detects exactly how would we visually imagine and separated block is detected with additional block for each column. Works exactly how we prefer it.

However, when that block is at bottom, which is the case here, it is mixing with one of the column block.

Is it possible to apply the logic to block found at bottom , which is utilized when block is found at top ?

Thanks for your help.
-M

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect block detection #4243

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Incorrect block detection #4243

mxav1111 Jan 20, 2025

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 3 comments

JorjMcKie Jan 20, 2025 Maintainer

JorjMcKie Jan 20, 2025 Maintainer

mxav1111 Jan 25, 2025 Author

mxav1111
Jan 20, 2025

JorjMcKie
Jan 20, 2025
Maintainer

JorjMcKie
Jan 20, 2025
Maintainer

mxav1111
Jan 25, 2025
Author