Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Color Information for Paragraphs #25

Closed
kreuzberger opened this issue Jan 12, 2024 · 11 comments · Fixed by #30
Closed

Color Information for Paragraphs #25

kreuzberger opened this issue Jan 12, 2024 · 11 comments · Fixed by #30

Comments

@kreuzberger
Copy link
Contributor

kreuzberger commented Jan 12, 2024

By deeper paragraph analysis i want to get information about the paragraph background color, e.g. to check for the rendered code example.
At which part in textbox.py could i get color information from the PDFObj "behind" the scenes or is this to late?
Where could be a good point for it to get this information?

@ubmarco
Copy link
Member

ubmarco commented Jan 12, 2024

I guess this will be difficult with the PDF standard in general.
There are single letters of a certain style (font, size, also color). And there is graphics such as rectangles, lines.
A colored background box is not coupled to the characters.
The only solution I see is finding paragraphs and their coordinates and matching it with graphics/figures on the same area.

@kreuzberger
Copy link
Contributor Author

kreuzberger commented Jan 15, 2024

Could follow but,.....

  • The codeblock itself in the visual_debug view itself is without any background color or background box
  • No image is extracted in the figures directory, "real" embedded images are
  • Other "background boxes", e.g. green boxes around "important" directives in the pdf is generated from (sphinx) are also not identified as "images" or other stuff.

Which tool for a ubuntu / linux platform could be best to analyze the generated pdf? Currently my pdfs are generated from html with weasyprint (https://github.com/kozea/weasyprint), maybe i look for configuration possiblities there too.

Storing the pdf as uncompressed pdfs does not realy help me more for debugging

@ubmarco
Copy link
Member

ubmarco commented Jan 15, 2024

I also want to know how background rectangles/figures are represented. However time is a scarce resource currently...
In the meantime: If libpdf does not provide the necessary information, you could try the latest version of the underlying libraries pdfminer and pdfplumber.
I will update you once I find the time to look into this.

@ubmarco
Copy link
Member

ubmarco commented Jan 15, 2024

Can you provide a minimal example of a PDF? Are you using this?

@kreuzberger
Copy link
Contributor Author

My main intention is to check pdf's generated with weasyprint and sphinx-simplepdf from here:
useblocks/sphinx-simplepdf#83

Currently i am using the patched version i provided as pull request with the "current" libraries from pdfminer and pdfplumber. And i am testing with my project documents to evaluate the possiblity of WHAT could be tested and WHAT makes sense to be tested.

I will provide a simple example with the code example, but question is where to provide: in the sphinx-simplepdf project or here. Both makes sense. As maintainer is up to you to decide.

@kreuzberger
Copy link
Contributor Author

kreuzberger commented Jan 16, 2024

I think one reason could also be the patched version i use has some errors in testing. Just executed the tests today.

platform linux -- Python 3.11.2, pytest-7.4.4, pluggy-1.3.0
rootdir: /src/github/libpdf
configfile: tox.ini
plugins: bdd-7.0.1
collected 22 items                                                                                                                                                                         

tests/test_api.py ...                                                                                                                                                                [ 13%]
tests/test_catalog.py .F.                                                                                                                                                            [ 27%]
tests/test_cli.py ..                                                                                                                                                                 [ 36%]
tests/test_details.py .                                                                                                                                                              [ 40%]
tests/test_ds93_chapter.py .                                                                                                                                                         [ 45%]
tests/test_figures.py FFF                                                                                                                                                            [ 59%]
tests/test_full_features.py .......                                                                                                                                                  [ 90%]
tests/test_import.py .                                                                                                                                                               [ 95%]
tests/test_tables.py .   

The test errors due to the ValueError in catalog extract seem to be identical to the errors in my pdf. I will investigate what could be the problem. May this could be also an error in one of the libraries, maybe something else. I will check it on the forked branch.

And the error in the figures could also be the reason why the box is not identified correctly. During processing of my pdf with no_annotations i got no errors, but with i got the same ValueErrors like in the test_catalog.py

@kreuzberger
Copy link
Contributor Author

fixed the test_catalog.py failing tests by checking for valid bbox's before continue processing.

The figure tests are mysterious to me.

The first tests wants to check to extract only figures with valid bboxes, but the (one) figure in the pdf has (due to the file test name) an invalid bbox. So the height is 0 and therefore the figure is filtered (correctly) out of the figure list.

So maybe check if the tests itself are valid.

@kreuzberger
Copy link
Contributor Author

after analysis of the pages with pdfplumber i see rect objects like this:

"rects": [
        {
          "x0": 56.25,
          "y0": 426.04943199999997,
          "x1": 539.02559025,
          "y1": 593.1397637499999,
          "width": 482.77559025000005,
          "height": 167.09033174999996,
          "pts": [
            [
              56.25,
              248.75000025000008
            ],
            [
              539.02559025,
              248.75000025000008
            ],
            [
              539.02559025,
              415.84033200000005
            ],
            [
              56.25,
              415.84033200000005
            ]
          ],
          "linewidth": 0,
          "stroke": 0,
          "fill": 1,
          "evenodd": 0,
          "stroking_color": null,
          "non_stroking_color": [
            0.858824,
            0.980392,
            0.956863
          ],
          "mcid": null,
          "tag": null,
          "object_type": "rect",
          "page_number": 4,
          "stroking_pattern": null,
          "non_stroking_pattern": null,
          "top": 248.75000025000008,
          "bottom": 415.84033200000005,
          "doctop": 248.75000025000008
        },

The non-stroking color seems to match exactly the RGB value defined for this background. So i try to find out where to get those rect objects

@kreuzberger
Copy link
Contributor Author

kreuzberger commented Jan 17, 2024

I integrated a solution that extracts the "rects" from pdfplumber as separate type in libpdf like figures, tables etc.
That means that the content of the "rects" is also removed from chapters/paragraphs like for tables and figures.
Could this be a solution or should the "rects" from pdfplumber be mapped to the "figures" of libpdf?
Both solution makes sense, but i think i would prefer mapping the pdfplumber types to a libpdf type "rect".

Also to clarify would be text extraction. Figure Text is removed from chapters/paragraphs. i think for rects we should

  • leave text in paragraph
  • add the "rects" additional / separate from text.

So i could search for text in chapter/paragrapsh (without any coloring / text information) or i could search for "highlighted" text parts

What do you think?

@kreuzberger
Copy link
Contributor Author

kreuzberger commented Jan 17, 2024

Added the rects extraction as arguments to main, mapping rect objects as "own" type.
no_rects: extraction of rects is disabled (default False)
crop_rects_text: rects text is removed from chapters/paragraphs if true, else duplicated in rects/paragraph (default False)

all tests "run" without the test_figures tests. Assumption is here that the old pdfplumber handeld rects and figures different or the "rects" were differently mapped to "figures".

All changes are push to the PR #24 and got merged into #30

@juiwenchen
Copy link
Contributor

Rect is introduced in the following PR. Credit for @kreuzberger

#30

@juiwenchen juiwenchen mentioned this issue Jan 23, 2024
@juiwenchen juiwenchen linked a pull request Jan 23, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants