-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibility to improve parsing perfomance by not using uniqid() and add an early return? #712
Comments
Sounds reasonable and interesting so please create a pull request and lets discuss there.
Without a deeper look into the code, it doesn't seem necessary to use an expensive function such as
Because I don't know where we go with the first point, please use a separate PR for the second one. At least for now. |
The reason for the Unique ID is that the simpler the replacement string, the higher the chance it may appear elsewhere in the document and cause the wrong text to be restored later. For example, That being said, if the Edit: Wait, you're saying the |
Exactly, |
Understood the increased likelihood of unintended text replaces. What do you think about using
paired with
Seems to result in comparable performance gains. |
$id = uniqid('IMAGE_', true);
$id = uniqid('STRING_', true);
$dictid = uniqid('DICT_', true); The following code should produce a random-enough string but with lesser CPU usage (not benchmarked though!): $hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7);
The code could be even simpler if an incremented integer value is used instead of So we would end up with something like: private function formatContent(?string $content): string
{
$elementId = 0;
// ...
$hash = hash('md2', ++$elementId);
// maybe length (5) can be even shorter?
$id = 'IMAGE_'.substr($hash, 0, 5);
// ...
} Btw.
Besides replacing $hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7).'_IMAGE_END';
I assume the reduction is because of |
Why use big function like a Message Digest hash when there's ElfHash which is faster than CRC32 from what i have heard.
New to php and i found this:
|
The performance loss is not caused by calling pdfparser/src/Smalot/PdfParser/PDFObject.php Lines 384 to 398 in a19d555
The call to My first idea was to simplify the placeholder which lead to a huge performance gain in my case with the downside of an increased likelihood of unintended replacements ( Second idea now is to slightly reduce complexity in the placeholder with I will do a few tests, compare performance and make a suggestion. |
You might want to benchmark using Maybe even |
I parse PDFs where strings are encoded in the bracket notation like
[(Xxxx)1.7(a)1.2(tt )-10.1(\374)1.1(be)1.4(r )-8.5(Xxxx)-7.8(xx)-9.5(xxx)2.2(xxx)-1.1(x)-9(xxx)2( )]TJ
I have an idea to improve parsing performance which reduced parsing time from 1.65 seconds to 0.54 seconds in my example:
First, I observed a significant performance increase when calling
Page::getTextArray()
on documents with strings like this with a simpler placeholder inPDFObject::formatContent()
like$id = "S_$i";
where$i
is a simple counting integer instead of$id = uniqid('STRING_', true);
Reason for this are the latter numerous calls to
str_replace()
in order to replace back the placeholders with the original content. The complex placeholder generated byuniqid()
seems to slow down the calls.Is there any reason for using this complex placeholder or can it be replaced by a much simpler one?
Second, I recognized 2 (more or less) subsequent calls to
PDFObject->getTextArray($this)
inPage::getTextArray()
around line where the first one might immediatlly return:pdfparser/src/Smalot/PdfParser/Page.php
Lines 346 to 350 in a19d555
Is there a reason for not returning in the
try
case and instead return the same call laterpdfparser/src/Smalot/PdfParser/Page.php
Line 365 in a19d555
?
Thank you for your time!
The text was updated successfully, but these errors were encountered: