Color Information for Paragraphs #25

kreuzberger · 2024-01-12T08:30:03Z

By deeper paragraph analysis i want to get information about the paragraph background color, e.g. to check for the rendered code example.
At which part in textbox.py could i get color information from the PDFObj "behind" the scenes or is this to late?
Where could be a good point for it to get this information?

ubmarco · 2024-01-12T21:50:53Z

I guess this will be difficult with the PDF standard in general.
There are single letters of a certain style (font, size, also color). And there is graphics such as rectangles, lines.
A colored background box is not coupled to the characters.
The only solution I see is finding paragraphs and their coordinates and matching it with graphics/figures on the same area.

kreuzberger · 2024-01-15T09:42:36Z

Could follow but,.....

The codeblock itself in the visual_debug view itself is without any background color or background box
No image is extracted in the figures directory, "real" embedded images are
Other "background boxes", e.g. green boxes around "important" directives in the pdf is generated from (sphinx) are also not identified as "images" or other stuff.

Which tool for a ubuntu / linux platform could be best to analyze the generated pdf? Currently my pdfs are generated from html with weasyprint (https://github.com/kozea/weasyprint), maybe i look for configuration possiblities there too.

Storing the pdf as uncompressed pdfs does not realy help me more for debugging

ubmarco · 2024-01-15T21:40:09Z

I also want to know how background rectangles/figures are represented. However time is a scarce resource currently...
In the meantime: If libpdf does not provide the necessary information, you could try the latest version of the underlying libraries pdfminer and pdfplumber.
I will update you once I find the time to look into this.

ubmarco · 2024-01-15T21:44:14Z

Can you provide a minimal example of a PDF? Are you using this?

kreuzberger · 2024-01-16T06:26:11Z

My main intention is to check pdf's generated with weasyprint and sphinx-simplepdf from here:
useblocks/sphinx-simplepdf#83

Currently i am using the patched version i provided as pull request with the "current" libraries from pdfminer and pdfplumber. And i am testing with my project documents to evaluate the possiblity of WHAT could be tested and WHAT makes sense to be tested.

I will provide a simple example with the code example, but question is where to provide: in the sphinx-simplepdf project or here. Both makes sense. As maintainer is up to you to decide.

kreuzberger · 2024-01-16T06:34:29Z

I think one reason could also be the patched version i use has some errors in testing. Just executed the tests today.

platform linux -- Python 3.11.2, pytest-7.4.4, pluggy-1.3.0
rootdir: /src/github/libpdf
configfile: tox.ini
plugins: bdd-7.0.1
collected 22 items                                                                                                                                                                         

tests/test_api.py ...                                                                                                                                                                [ 13%]
tests/test_catalog.py .F.                                                                                                                                                            [ 27%]
tests/test_cli.py ..                                                                                                                                                                 [ 36%]
tests/test_details.py .                                                                                                                                                              [ 40%]
tests/test_ds93_chapter.py .                                                                                                                                                         [ 45%]
tests/test_figures.py FFF                                                                                                                                                            [ 59%]
tests/test_full_features.py .......                                                                                                                                                  [ 90%]
tests/test_import.py .                                                                                                                                                               [ 95%]
tests/test_tables.py .

The test errors due to the ValueError in catalog extract seem to be identical to the errors in my pdf. I will investigate what could be the problem. May this could be also an error in one of the libraries, maybe something else. I will check it on the forked branch.

And the error in the figures could also be the reason why the box is not identified correctly. During processing of my pdf with no_annotations i got no errors, but with i got the same ValueErrors like in the test_catalog.py

kreuzberger · 2024-01-16T14:56:09Z

fixed the test_catalog.py failing tests by checking for valid bbox's before continue processing.

The figure tests are mysterious to me.

The first tests wants to check to extract only figures with valid bboxes, but the (one) figure in the pdf has (due to the file test name) an invalid bbox. So the height is 0 and therefore the figure is filtered (correctly) out of the figure list.

So maybe check if the tests itself are valid.

kreuzberger · 2024-01-16T16:00:08Z

after analysis of the pages with pdfplumber i see rect objects like this:

"rects": [
        {
          "x0": 56.25,
          "y0": 426.04943199999997,
          "x1": 539.02559025,
          "y1": 593.1397637499999,
          "width": 482.77559025000005,
          "height": 167.09033174999996,
          "pts": [
            [
              56.25,
              248.75000025000008
            ],
            [
              539.02559025,
              248.75000025000008
            ],
            [
              539.02559025,
              415.84033200000005
            ],
            [
              56.25,
              415.84033200000005
            ]
          ],
          "linewidth": 0,
          "stroke": 0,
          "fill": 1,
          "evenodd": 0,
          "stroking_color": null,
          "non_stroking_color": [
            0.858824,
            0.980392,
            0.956863
          ],
          "mcid": null,
          "tag": null,
          "object_type": "rect",
          "page_number": 4,
          "stroking_pattern": null,
          "non_stroking_pattern": null,
          "top": 248.75000025000008,
          "bottom": 415.84033200000005,
          "doctop": 248.75000025000008
        },

The non-stroking color seems to match exactly the RGB value defined for this background. So i try to find out where to get those rect objects

kreuzberger · 2024-01-17T08:28:36Z

I integrated a solution that extracts the "rects" from pdfplumber as separate type in libpdf like figures, tables etc.
That means that the content of the "rects" is also removed from chapters/paragraphs like for tables and figures.
Could this be a solution or should the "rects" from pdfplumber be mapped to the "figures" of libpdf?
Both solution makes sense, but i think i would prefer mapping the pdfplumber types to a libpdf type "rect".

Also to clarify would be text extraction. Figure Text is removed from chapters/paragraphs. i think for rects we should

leave text in paragraph
add the "rects" additional / separate from text.

So i could search for text in chapter/paragrapsh (without any coloring / text information) or i could search for "highlighted" text parts

What do you think?

kreuzberger · 2024-01-17T09:26:27Z

Added the rects extraction as arguments to main, mapping rect objects as "own" type.
no_rects: extraction of rects is disabled (default False)
crop_rects_text: rects text is removed from chapters/paragraphs if true, else duplicated in rects/paragraph (default False)

all tests "run" without the test_figures tests. Assumption is here that the old pdfplumber handeld rects and figures different or the "rects" were differently mapped to "figures".

All changes are push to the PR #24 and got merged into #30

juiwenchen · 2024-01-23T13:54:34Z

Rect is introduced in the following PR. Credit for @kreuzberger

#30

juiwenchen closed this as completed Jan 23, 2024

juiwenchen mentioned this issue Jan 23, 2024

Rect model #30

Merged

juiwenchen linked a pull request Jan 23, 2024 that will close this issue

Rect model #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Color Information for Paragraphs #25

Color Information for Paragraphs #25

kreuzberger commented Jan 12, 2024 •

edited

Loading

ubmarco commented Jan 12, 2024

kreuzberger commented Jan 15, 2024 •

edited

Loading

ubmarco commented Jan 15, 2024

ubmarco commented Jan 15, 2024 •

edited

Loading

kreuzberger commented Jan 16, 2024

kreuzberger commented Jan 16, 2024 •

edited

Loading

kreuzberger commented Jan 16, 2024

kreuzberger commented Jan 16, 2024

kreuzberger commented Jan 17, 2024 •

edited

Loading

kreuzberger commented Jan 17, 2024 •

edited

Loading

juiwenchen commented Jan 23, 2024

Color Information for Paragraphs #25

Color Information for Paragraphs #25

Comments

kreuzberger commented Jan 12, 2024 • edited Loading

ubmarco commented Jan 12, 2024

kreuzberger commented Jan 15, 2024 • edited Loading

ubmarco commented Jan 15, 2024

ubmarco commented Jan 15, 2024 • edited Loading

kreuzberger commented Jan 16, 2024

kreuzberger commented Jan 16, 2024 • edited Loading

kreuzberger commented Jan 16, 2024

kreuzberger commented Jan 16, 2024

kreuzberger commented Jan 17, 2024 • edited Loading

kreuzberger commented Jan 17, 2024 • edited Loading

juiwenchen commented Jan 23, 2024

kreuzberger commented Jan 12, 2024 •

edited

Loading

kreuzberger commented Jan 15, 2024 •

edited

Loading

ubmarco commented Jan 15, 2024 •

edited

Loading

kreuzberger commented Jan 16, 2024 •

edited

Loading

kreuzberger commented Jan 17, 2024 •

edited

Loading

kreuzberger commented Jan 17, 2024 •

edited

Loading