Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowed memory size exhausted on Font.php line 150 #735

Open
UnnitMetaliya opened this issue Sep 16, 2024 · 14 comments
Open

Allowed memory size exhausted on Font.php line 150 #735

UnnitMetaliya opened this issue Sep 16, 2024 · 14 comments

Comments

@UnnitMetaliya
Copy link

  • PHP Version: PHP 8.3.10 (cli)
  • PDFParser Version: smalot/pdfparser": "^2.11"

Description:

PDF input

unparseble.pdf

Expected output & actual output

Getting error:

PHP Fatal error: Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes) in ../smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 150

Symfony\Component\ErrorHandler\Error\FatalError

`Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes)

at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:150 146▕ 147▕ if (!isset(self::$uchrCache[$code])) { 148▕ // html_entity_decode() will not work with UTF-16 or UTF-32 char entities, 149▕ // therefore, we use mb_convert_encoding() instead ➜ 150▕ self::$uchrCache[$code] = mb_convert_encoding("&#{$code};", 'UTF-8', 'HTML-ENTITIES'); 151▕ } 152▕ 153▕ return self::$uchrCache[$code]; 154▕ }

`Whoops\Exception\ErrorException

Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes)

at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:150
146▕
147▕ if (!isset(self::$uchrCache[$code])) {
148▕ // html_entity_decode() will not work with UTF-16 or UTF-32 char entities,
149▕ // therefore, we use mb_convert_encoding() instead
➜ 150▕ self::$uchrCache[$code] = mb_convert_encoding("&#{$code};", 'UTF-8', 'HTML-ENTITIES');
151▕ }
152▕
153▕ return self::$uchrCache[$code];
154▕ }

  +1 vendor frames

2 [internal]:0
Whoops\Run::handleShutdown()

Code

$this->pdfParser = new Parser();
$pdf = $this->pdfParser->parseFile($file);

@GreyWyvern
Copy link
Contributor

GreyWyvern commented Sep 16, 2024

Ultimately this is happening because the sample document has Font bfrange values such as:

<037F> <0389> [<FEDFFEE0FBAB> <FE80> <0622> <FE82> <0623> <FE84> <0624> <FE86> <0625> <FE88> <0626>]

... whereas PdfParser only extracts rows that match a regexp targeting the following format:

<037C> <037D> <FD3E>

I think the solution is to treat rows that have square brackets as a direct string replacement rather than a numerical offset? But I am entirely unsure right now and will need to read up on the correct behaviour here.

Edit: If I alter the regexp in Font.php on line 220:

$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>?[ \r\n]+/is';

... like so:

$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>? *[\r\n]+/is';

... this makes PdfParser ignore all bfrange lines that have square brackets and the document text will be output successfully. However, the content of the missing rows not being parsed probably means the document output text is missing some content.

@UnnitMetaliya
Copy link
Author

@k00ni any idea how such a small thing like [ can cause memory leak issue?

@k00ni
Copy link
Collaborator

k00ni commented Sep 17, 2024

any idea how such a small thing like [ can cause memory leak issue?

No.

@GreyWyvern
Copy link
Contributor

@k00ni any idea how such a small thing like [ can cause memory leak issue?

It's giving the wrong numerical start and end values to a for loop and it doesn't end. Simple as that.

It's not the [ specifically, it's that the regexp is matching values inside the square brackets and using character values incorrectly as offsets.

@GreyWyvern
Copy link
Contributor

Note that the "fix" I provided in my previous post does NOT completely solve this issue; it should not be used as a workaround as it will cause errors in other files in the test suite.

I noticed, after my posts, that the current bfrange section in Font.php does account for the <0000> <0000> [<0000>] line format, but does so after it searches for regular <0000> <0000> <0000> style. The solution is to move the square-bracket format loop first, and then delete those lines from the section before allowing the regular-format loop to go at it. In this way, all items in the font table are properly filled and there is no infinite loop.

The proper fix is below, replacing this entire bfrange section:

// Support for multiple bfrange sections
if (preg_match_all('/beginbfrange(?P<sections>.*?)endbfrange/s', $content, $matches)) {
    foreach ($matches['sections'] as $section) {
        // Support for : <srcCode1> <srcCodeN> [<dstString1> <dstString2> ... <dstStringN>]
        // Some PDF file has 2-byte Unicode values on new lines > added \r\n
        $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *\[(?P<strings>[\r\n<>0-9A-F ]+)\][ \r\n]+/is';

        preg_match_all($regexp, $section, $matches);

        foreach ($matches['from'] as $key => $from) {
            $char_from = hexdec($from);
            $strings = [];

            preg_match_all('/<(?P<string>[0-9A-F]+)> */is', $matches['strings'][$key], $strings);

            foreach ($strings['string'] as $position => $string) {
                $parts = preg_split(
                    '/([0-9A-F]{4})/i',
                    $string,
                    0,
                    \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
                );
                $text = '';
                foreach ($parts as $part) {
                    $text .= self::uchr(hexdec($part));
                }
                $this->table[$char_from + $position] = $text;
            }

            // Remove these found matches from the bfrange section
            // This prevents the regexp below from finding false matches
            $section = str_replace($matches[0][$key], '', $section);
        }

        // Support for : <srcCode1> <srcCode2> <dstString>
        $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>[ \r\n]+/is';

        preg_match_all($regexp, $section, $matches);

        foreach ($matches['from'] as $key => $from) {
            $char_from = hexdec($from);
            $char_to = hexdec($matches['to'][$key]);
            $offset = hexdec($matches['offset'][$key]);

            for ($char = $char_from; $char <= $char_to; ++$char) {
                $this->table[$char] = self::uchr($char - $char_from + $offset);
            }
        }
    }
}

The example PDF is a bit too large (466kB) to include in the test-suite I think. @UnnitMetaliya if you remember how you created this file (if you in fact created it) and are able to make a smaller version we can include in the test suite, would you be willing to do that for us?

Otherwise, may we include the file you posted directly in the test suite? It's too complex of an issue to replicate with a mock document. Correctly declared font-tables are required.

@UnnitMetaliya
Copy link
Author

@GreyWyvern I did not create the test file but that pdf is publicly available. So, you can use it for sure.

@UnnitMetaliya
Copy link
Author

@k00ni do you think the fix @GreyWyvern can go in as permanent fix, if it can be considered as one?

@k00ni
Copy link
Collaborator

k00ni commented Sep 18, 2024

@k00ni do you think the fix @GreyWyvern can go in as permanent fix, if it can be considered as one?

@GreyWyvern thank you for taking the time here.

His changes look reasonable at first glance. I suggest one creates a PR and we see how it goes. My time is currently very limited but I try to help when I can.

@GreyWyvern
Copy link
Contributor

I suggest one creates a PR and we see how it goes. My time is currently very limited but I try to help when I can.

I'm on vacation starting tomorrow and will be back on Monday next week. I can create the PR then.

@UnnitMetaliya
Copy link
Author

@GreyWyvern I can create now if you want?

@GreyWyvern
Copy link
Contributor

@GreyWyvern I can create now if you want?

You can do that, sure. Thanks! But I won't be around to review it until next week. :)

@GreyWyvern
Copy link
Contributor

Are you working on a PR for this, @UnnitMetaliya? I don't want to step on any toes.

@UnnitMetaliya
Copy link
Author

@GreyWyvern yes. I should be able to put it out there soon. Just busy with work.

satrun77 added a commit to silverstripeltd/pdfparser that referenced this issue Nov 12, 2024
Fix copied from smalot#735
This is temporary fork until the depenency is fixed
@ic-officient
Copy link

@GreyWyvern I have forked the repo to ic-officient/pdfparser and made some minor changes in Font.php to the bfrange handling. I haven't made special tests for the changes, but it seems to work just fine for me and it does solve the OOM problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants