-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hexadecimal decoding (fixes #715) #716
Conversation
There are syntax errors, please check: https://github.com/smalot/pdfparser/actions/runs/9259258917/job/25483604622?pr=716#step:6:13 |
@k00ni I have fixed PHP 7.x incompatibility. |
Is there a reference in the PDF specification you referring to with this changes in general? |
Probably, but it was not intentional. Let it be "PDF Reference sixth edition" by Adobe Systems Incorporated (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf) page 53:
(Maybe tests should include this string instead of mine.) I was mainly interested in reading Sig objects (https://www.adobe.com/devnet-docs/acrobatetk/tools/DigSig/Acrobat_DigitalSignatures_in_PDF.pdf), but pdfparser was truncating binary data so I was forced to dive into basics and found more problems as in attached tests. My concern is I do not understand why pdfparser is parsing strings like this: pdfparser/src/Smalot/PdfParser/Element/ElementString.php Lines 62 to 74 in 4b86c66
so I am not comfortable at modifying it. I am looking for advice or help in fixing problem as I written earlier in #715. |
I have rewritten element string parser and now tests pass.
I am awaiting your feedback. |
Thanks for your effort! I try to give you feedback as soon as possible. |
- Moved some test-related code to separate function to improve code readability - renamed and refined testSpecialCharsEncodedAsHex: you don't need an if-clause in a test if you insist on certain values along the way, just use assertX to check for expected values ("fail early") - ElementString: moved the part which handles escaped characters to a separate function to improve code readability - added references / comments
... which fail (mostly) without the fixes in this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krzyc Again, thanks for your effort. I dove into your code and the PDF specification 1.7 and holy cow, the spec has over 1300 pages. This topic is not easy but I think I got what you try to achieve with your patch as well as what the spec says in this regard. But further feedback is welcomed as always, CC: @j0k3r @GreyWyvern.
I took the liberty to change some of your code. Here is a summary of my changes:
- refactored some code parts
- added references and some comments
- added a few tests
(for further details, please see my commit notes)
About the tests: I copied a few example texts from the spec (page 54 + 55) to see if PDFParser fails without your fixes. It does most of the time. Only this one works pre- and post patch.
What is great is the fact that it now parses strings like (ref):
(Strings may contain balanced parentheses ( ) and
special characters (*!&}^% and so on).)
Pre-patch PDFParser would have only got until Strings may contain balanced parentheses (
. This change is super relevant for scientific papers which often use parentheses in some way (( ), [ ], { }, ...
).
I can image there are further edge cases which still might not work. But we should aim for the following: It can get new features/capabilities as long as it retains its current ones.
To your knowledge, are there any cases left which worked before but don't now?
case ' ': // TODO: this should probably be removed - kept for compatibility | ||
$processedName .= $nextChar; | ||
$name = substr($name, 1); | ||
++$position; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "kept for compatibility"? To what?
// repackage string as standard | ||
$name = '('.self::decode($name).')'; | ||
$element = ElementDate::parse($name, $document); | ||
$name = self::decode($name); | ||
$element = ElementDate::parse('('.$name.')', $document); | ||
|
||
if (!$element) { | ||
$element = ElementString::parse($name, $document); | ||
$element = new self($name); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please give me a short summary what you try to accomplish here.
Why removing the reference to $document
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why switching from ElementString
to ElementHexa
? Is it to match others existing behavior from other elements classes?
After an initial review, this appears to be an issue solely with ElementString::parse. There are zero changes required to the ElementHexa.php file IMHO. ElementHexa::parse sends ElementString::parse a perfectly valid string, and it's ElementString that proceeds to mess it up. The only reason ElementHexa is involved at all is because it's a vector that allows this sort of tricky (but still valid!) string to reach ElementString, whereas normally ElementString would rarely see such strings. ElementString::parse expects all internal parentheses to be escaped, and I would wager that in >99% of PDFs, they are. But according to the PDF Reference, balanced, unescaped parentheses are allowed. So this is a matter of just updating ElementString::parse to handle balanced, unescaped parentheses. I haven't fully examined the updated code in ElementString::parse in this PR, but as long as it does that, there should be no other edits necessary to ElementHexa.php. Edit: In fact, when I update the test hexadecimal string to |
What are your plans here @krzyc? As mentioned, the change in ElementHexa does not seem to be required. |
I've been thinking about this some more and it's a little more complicated than it seems at first glance. ElementString accepts (and currently expects) strings with escaped slashes and parentheses. 99% of the time strings from PDF document streams will be escaped, BUT unescaped, balanced parentheses are allowed. Unescaped backslashes are NOT allowed; according to the PDF reference if they aren't followed by a valid escape character, they should be ignored. And that is the problem with strings from ElementHexa. It passes a completely unescaped (but valid) string, but if it contains literal backslashes, then further parsing it will remove them or result in an incorrect special character. Strings that originate from ElementHexa should be passed to ElementString to create the appropriate object, but |
I don't want to let this PR become stale, but my time is limited currently. What is the minimal amount of work to finish this? A revert of the ElementHexa part and some refinement should be enough. Any suggestions? |
Closed due to inactivity and no response from the author. |
Type of pull request
About
Fix for #715 as requested by @k00ni. Tests are passing but I am not confident that this will not break pdfparser because I lack knowledge of internal work of library and PDF format.