-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Color Information for Paragraphs #25
Comments
I guess this will be difficult with the PDF standard in general. |
Could follow but,.....
Which tool for a ubuntu / linux platform could be best to analyze the generated pdf? Currently my pdfs are generated from html with weasyprint (https://github.com/kozea/weasyprint), maybe i look for configuration possiblities there too. Storing the pdf as uncompressed pdfs does not realy help me more for debugging |
I also want to know how background rectangles/figures are represented. However time is a scarce resource currently... |
Can you provide a minimal example of a PDF? Are you using this? |
My main intention is to check pdf's generated with weasyprint and sphinx-simplepdf from here: Currently i am using the patched version i provided as pull request with the "current" libraries from pdfminer and pdfplumber. And i am testing with my project documents to evaluate the possiblity of WHAT could be tested and WHAT makes sense to be tested. I will provide a simple example with the code example, but question is where to provide: in the sphinx-simplepdf project or here. Both makes sense. As maintainer is up to you to decide. |
I think one reason could also be the patched version i use has some errors in testing. Just executed the tests today.
The test errors due to the ValueError in catalog extract seem to be identical to the errors in my pdf. I will investigate what could be the problem. May this could be also an error in one of the libraries, maybe something else. I will check it on the forked branch. And the error in the figures could also be the reason why the box is not identified correctly. During processing of my pdf with |
fixed the test_catalog.py failing tests by checking for valid bbox's before continue processing. The figure tests are mysterious to me. The first tests wants to check to extract only figures with valid bboxes, but the (one) figure in the pdf has (due to the file test name) an invalid bbox. So the height is 0 and therefore the figure is filtered (correctly) out of the figure list. So maybe check if the tests itself are valid. |
after analysis of the pages with pdfplumber i see rect objects like this:
The non-stroking color seems to match exactly the RGB value defined for this background. So i try to find out where to get those rect objects |
I integrated a solution that extracts the "rects" from pdfplumber as separate type in libpdf like figures, tables etc. Also to clarify would be text extraction. Figure Text is removed from chapters/paragraphs. i think for rects we should
So i could search for text in chapter/paragrapsh (without any coloring / text information) or i could search for "highlighted" text parts What do you think? |
Added the rects extraction as arguments to main, mapping rect objects as "own" type. all tests "run" without the test_figures tests. Assumption is here that the old pdfplumber handeld rects and figures different or the "rects" were differently mapped to "figures". |
Rect is introduced in the following PR. Credit for @kreuzberger |
By deeper paragraph analysis i want to get information about the paragraph background color, e.g. to check for the rendered code example.
At which part in textbox.py could i get color information from the PDFObj "behind" the scenes or is this to late?
Where could be a good point for it to get this information?
The text was updated successfully, but these errors were encountered: