Optimize CDATA normalization using memchr and blockwise copying #133

adamreichold · 2025-01-13T13:06:32Z

memchr should be a well-known quantity as a dependency and provides ample optimization opportunities throughout the parser. For example, adjusting the normalization of line endings to use yields the following improvement for processing 1 MB of random CDATA:

before: test cdata_roxmltree                        ... bench:   6,039,358 ns/iter (+/- 41,232)
after:  test cdata_roxmltree                        ... bench:   3,143,947 ns/iter (+/- 168,323)

This would also supersede #128

RazrFalcon · 2025-01-13T13:13:49Z

Dependencies - bad. Are you sure there is no other way for use to improve it?

adamreichold · 2025-01-13T13:20:58Z

Dependencies - bad. Are you sure there is no other way for use to improve it?

There are other ways, see #128 or one could implement the same chunking strategy using std's str::find (which however did not reliably apply SIMD the last time I tried this). This particular optimization is only meant as an example of what access to memchr can provide in general.

So while I agree that dependencies should never be taken on lightly, the performance implications of having access to memchr could reverberate throughout the code base and the dependency itself seems manageable bringing in no transitive dependencies of its own.

adamreichold · 2025-01-13T13:26:36Z

one could implement the same chunking strategy using std's str::find (which however did not reliably apply SIMD the last time I tried this)

Gave this a try and it is still a significant improvement if not on par with memchr, c.f.

before:    test cdata_roxmltree                        ... bench:   6,039,358 ns/iter (+/- 41,232)
memchr:    test cdata_roxmltree                        ... bench:   3,143,947 ns/iter (+/- 168,323)
str::find: test cdata_roxmltree                        ... bench:   4,323,370 ns/iter (+/- 43,884)

RazrFalcon · 2025-01-13T13:47:43Z

I don't mind memchr specifically, but the existing code is pretty slow on its own. We use chars + enumerate + CharToBytes - that's pretty slow. We might consider optimizing it first.
Also, CDATA it's an edge case, as far as I know. So unless we use memchr everywhere - it's not much of a benefit.
Also, aren't memchr is used by Rust's std already?

Also, I'm not sure your code is equivalent. The existing one operates on Unicode codepoints, not bytes. We should make sure we're not breaking non-Latin XMLs.
PS: I don't remember why are going thought all those hoops with TextBuffer and CharToBytes to begin with. But I guess it was important at some point.

adamreichold · 2025-01-13T16:27:02Z

Also, CDATA it's an edge case, as far as I know. So unless we use memchr everywhere - it's not much of a benefit.

As written, CDATA was just a low hanging fruit from my PoV. If we do add memchr, then there are many places in the parser and tokenizer where it could be applied.

Also, aren't memchr is used by Rust's std already?

memchr is a dependency of std, but it usage is indirect via the Pattern abstraction which appears to imply some overhead as per #133 (comment)

Also, I'm not sure your code is equivalent. The existing one operates on Unicode codepoints, not bytes. We should make sure we're not breaking non-Latin XMLs.

For CDATA only line endings are normalized so the relevant part of the specification is

translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

And since both \r and \n are single-byte characters and everything else is passed on as it is, there is no effect on multi-byte characters. Also note that the second version I pushed here that uses str::find instead of memchr basically operates on characters and string slices. (The only place where it accesses a single byte is for the lookahead and I could also use starts_with("\n") there if I really wanted to avoid individual bytes.)

PS: I don't remember why are going thought all those hoops with TextBuffer and CharToBytes to begin with. But I guess it was important at some point.

To be honest, I think this code was copied over from text processing and then slimmed down. For example, the is_empty check at the end is impossible to hit as the resulting buffer can never be empty. (The only way for it to become empty is for the input to be empty but then we would pass that on as borrowed text via the fast path.)

So in summary, since you are unsure the code is equivalent and it is a significant improvement even using just str::find, I would suggest we take that route without the additional dependency and concentrate on correctness for now. I'll add another CDATA test case involving multi-byte characters close to line endings to ensure the character-by-character processing really was unnecessary.

adamreichold · 2025-01-13T16:35:49Z

So I added

<svg>
    <style><![CDATA[some
🐉
👃
]]></style>
</svg>

as a test case which requires line endings to be normalized and it behaves exactly the same on master as here.

adamreichold · 2025-01-13T17:39:02Z

Also, I'm not sure your code is equivalent. The existing one operates on Unicode codepoints, not bytes. We should make sure we're not breaking non-Latin XMLs.

I fear stating something well-known here, but the underlying invariant why \r and \n can be handled individually without looking at other multi-byte characters is that UTF-8 is self-synchronizing, i.e. looking at a single byte you can immediately tell whether it is the start of a code point or not. This implies that \r or \n will never be one of the bytes making up a multi-byte character and looking directly for these bytes will not break up other characters.

RazrFalcon · 2025-01-13T18:14:18Z

Got it. Agree. As long as we have a couple more tests and lxml produces the same output - we should be fine.
It was a long time since I last worked on this project.

For having nicer history I would suggest having two commits:

Remove TextBuffer from process_cdata, aka simplify it + tests
Use memchr in process_cdata, aka opimization

adamreichold · 2025-01-13T18:35:28Z

Will make it a two-step process, first without TextBuffer using bytes only and then using str::find (leaving out the memchr dependency to keep things simple).

I did have a look the existing CDATA tests and besides the one missing case of multi-byte characters before or after line breaks which I already added, I don't see any glaring omissions. Did you have some specific test case in mind?

adamreichold · 2025-01-13T18:53:09Z

So added a simplified bytewise implementation with two additional test cases and then replaced it by the blockwise copying one without any changes in the output.

…al tests.

…se copying

adamreichold · 2025-01-13T19:39:05Z

Hhhmmm, so I had a look at the source of the standard library, c.f. https://doc.rust-lang.org/stable/src/core/str/pattern.rs.html#439, and while memchr is used for single-character patterns, it is still not specialized for single-byte patterns and the multi-character variants do not use memchr at all. So for example, we could not use str::find to meaningfully accelerate the normalization checks for text and attribute values. So, I think I'll go ahead and add another commit for memchr after all.

If I had to go to a lonely island where survival would depend on writing fast parser and I could only take one crate, it would certainly be memchr...

… compared to the standard library.

RazrFalcon · 2025-01-14T11:36:40Z

Did you have some specific test case in mind?

A single multi-byte tests would be enough. As long as you think our tests cover all cases - it's fine.

If I had to go to a lonely island where survival would depend on writing fast parser and I could only take one crate, it would certainly be memchr...

😄
I do agree. It's a nice bite-size crate.

adamreichold · 2025-01-14T19:26:32Z

A single multi-byte tests would be enough. As long as you think our tests cover all cases - it's fine.

Ok, I did add a new multi-byte test and another test with different combinations of \r\n and \n bunched up on the idea of catching off-by-one errors when copying the chunks. So I guess we should be fine for the time being.

adamreichold force-pushed the memchr-cdata branch from 4d6cf44 to 8047ec2 Compare January 13, 2025 13:07

adamreichold changed the title ~~Add dependency on memchr and use it to optimize CDATA normalization~~ RFC: Add dependency on memchr and use it to optimize CDATA normalization Jan 13, 2025

adamreichold force-pushed the memchr-cdata branch from 8047ec2 to 54b9f68 Compare January 13, 2025 13:13

adamreichold force-pushed the memchr-cdata branch from 54b9f68 to 506e927 Compare January 13, 2025 13:15

adamreichold mentioned this pull request Jan 13, 2025

Do not UTF8-encode individual characters when processing UTF8-encoded CDATA. #128

Closed

adamreichold changed the title ~~RFC: Add dependency on memchr and use it to optimize CDATA normalization~~ Optimize CDATA normalization using str::find and blockwise copying Jan 13, 2025

adamreichold force-pushed the memchr-cdata branch from e6284a6 to 0f4e595 Compare January 13, 2025 18:50

adamreichold added 2 commits January 13, 2025 20:14

Simplify normalization of line-endings in CDATA supported by addition…

ae3aecc

…al tests.

Optimize CDATA normalization using str::find a.k.a memchr and blockwi…

44db645

…se copying

adamreichold force-pushed the memchr-cdata branch from 0f4e595 to 44db645 Compare January 13, 2025 19:15

Add dependency on memchr to further optimize finding ASCII characters…

bb89329

… compared to the standard library.

adamreichold changed the title ~~Optimize CDATA normalization using str::find and blockwise copying~~ Optimize CDATA normalization using memchr and blockwise copying Jan 13, 2025

adamreichold merged commit 9d6a32f into master Jan 14, 2025
4 checks passed

adamreichold deleted the memchr-cdata branch January 14, 2025 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CDATA normalization using memchr and blockwise copying #133

Optimize CDATA normalization using memchr and blockwise copying #133

adamreichold commented Jan 13, 2025 •

edited

Loading

RazrFalcon commented Jan 13, 2025

adamreichold commented Jan 13, 2025

adamreichold commented Jan 13, 2025

RazrFalcon commented Jan 13, 2025

adamreichold commented Jan 13, 2025 •

edited

Loading

adamreichold commented Jan 13, 2025

adamreichold commented Jan 13, 2025

RazrFalcon commented Jan 13, 2025

adamreichold commented Jan 13, 2025

adamreichold commented Jan 13, 2025

adamreichold commented Jan 13, 2025 •

edited

Loading

RazrFalcon commented Jan 14, 2025

adamreichold commented Jan 14, 2025

Optimize CDATA normalization using memchr and blockwise copying #133

Optimize CDATA normalization using memchr and blockwise copying #133

Conversation

adamreichold commented Jan 13, 2025 • edited Loading

RazrFalcon commented Jan 13, 2025

adamreichold commented Jan 13, 2025

adamreichold commented Jan 13, 2025

RazrFalcon commented Jan 13, 2025

adamreichold commented Jan 13, 2025 • edited Loading

adamreichold commented Jan 13, 2025

adamreichold commented Jan 13, 2025

RazrFalcon commented Jan 13, 2025

adamreichold commented Jan 13, 2025

adamreichold commented Jan 13, 2025

adamreichold commented Jan 13, 2025 • edited Loading

RazrFalcon commented Jan 14, 2025

adamreichold commented Jan 14, 2025

adamreichold commented Jan 13, 2025 •

edited

Loading

adamreichold commented Jan 13, 2025 •

edited

Loading

adamreichold commented Jan 13, 2025 •

edited

Loading