Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CDATA normalization using memchr and blockwise copying #133

Merged
merged 3 commits into from
Jan 14, 2025

Conversation

adamreichold
Copy link
Collaborator

@adamreichold adamreichold commented Jan 13, 2025

memchr should be a well-known quantity as a dependency and provides ample optimization opportunities throughout the parser. For example, adjusting the normalization of line endings to use yields the following improvement for processing 1 MB of random CDATA:

before: test cdata_roxmltree                        ... bench:   6,039,358 ns/iter (+/- 41,232)
after:  test cdata_roxmltree                        ... bench:   3,143,947 ns/iter (+/- 168,323)

This would also supersede #128

@adamreichold adamreichold changed the title Add dependency on memchr and use it to optimize CDATA normalization RFC: Add dependency on memchr and use it to optimize CDATA normalization Jan 13, 2025
@RazrFalcon
Copy link
Owner

Dependencies - bad. Are you sure there is no other way for use to improve it?

@adamreichold
Copy link
Collaborator Author

Dependencies - bad. Are you sure there is no other way for use to improve it?

There are other ways, see #128 or one could implement the same chunking strategy using std's str::find (which however did not reliably apply SIMD the last time I tried this). This particular optimization is only meant as an example of what access to memchr can provide in general.

So while I agree that dependencies should never be taken on lightly, the performance implications of having access to memchr could reverberate throughout the code base and the dependency itself seems manageable bringing in no transitive dependencies of its own.

@adamreichold
Copy link
Collaborator Author

one could implement the same chunking strategy using std's str::find (which however did not reliably apply SIMD the last time I tried this)

Gave this a try and it is still a significant improvement if not on par with memchr, c.f.

before:    test cdata_roxmltree                        ... bench:   6,039,358 ns/iter (+/- 41,232)
memchr:    test cdata_roxmltree                        ... bench:   3,143,947 ns/iter (+/- 168,323)
str::find: test cdata_roxmltree                        ... bench:   4,323,370 ns/iter (+/- 43,884)

@RazrFalcon
Copy link
Owner

I don't mind memchr specifically, but the existing code is pretty slow on its own. We use chars + enumerate + CharToBytes - that's pretty slow. We might consider optimizing it first.
Also, CDATA it's an edge case, as far as I know. So unless we use memchr everywhere - it's not much of a benefit.
Also, aren't memchr is used by Rust's std already?

Also, I'm not sure your code is equivalent. The existing one operates on Unicode codepoints, not bytes. We should make sure we're not breaking non-Latin XMLs.
PS: I don't remember why are going thought all those hoops with TextBuffer and CharToBytes to begin with. But I guess it was important at some point.

@adamreichold
Copy link
Collaborator Author

adamreichold commented Jan 13, 2025

Also, CDATA it's an edge case, as far as I know. So unless we use memchr everywhere - it's not much of a benefit.

As written, CDATA was just a low hanging fruit from my PoV. If we do add memchr, then there are many places in the parser and tokenizer where it could be applied.

Also, aren't memchr is used by Rust's std already?

memchr is a dependency of std, but it usage is indirect via the Pattern abstraction which appears to imply some overhead as per #133 (comment)

Also, I'm not sure your code is equivalent. The existing one operates on Unicode codepoints, not bytes. We should make sure we're not breaking non-Latin XMLs.

For CDATA only line endings are normalized so the relevant part of the specification is

translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

And since both \r and \n are single-byte characters and everything else is passed on as it is, there is no effect on multi-byte characters. Also note that the second version I pushed here that uses str::find instead of memchr basically operates on characters and string slices. (The only place where it accesses a single byte is for the lookahead and I could also use starts_with("\n") there if I really wanted to avoid individual bytes.)

PS: I don't remember why are going thought all those hoops with TextBuffer and CharToBytes to begin with. But I guess it was important at some point.

To be honest, I think this code was copied over from text processing and then slimmed down. For example, the is_empty check at the end is impossible to hit as the resulting buffer can never be empty. (The only way for it to become empty is for the input to be empty but then we would pass that on as borrowed text via the fast path.)


So in summary, since you are unsure the code is equivalent and it is a significant improvement even using just str::find, I would suggest we take that route without the additional dependency and concentrate on correctness for now. I'll add another CDATA test case involving multi-byte characters close to line endings to ensure the character-by-character processing really was unnecessary.

@adamreichold
Copy link
Collaborator Author

So I added

<svg>
    <style><![CDATA[some
🐉
👃
]]></style>
</svg>

as a test case which requires line endings to be normalized and it behaves exactly the same on master as here.

@adamreichold adamreichold changed the title RFC: Add dependency on memchr and use it to optimize CDATA normalization Optimize CDATA normalization using str::find and blockwise copying Jan 13, 2025
@adamreichold
Copy link
Collaborator Author

Also, I'm not sure your code is equivalent. The existing one operates on Unicode codepoints, not bytes. We should make sure we're not breaking non-Latin XMLs.

I fear stating something well-known here, but the underlying invariant why \r and \n can be handled individually without looking at other multi-byte characters is that UTF-8 is self-synchronizing, i.e. looking at a single byte you can immediately tell whether it is the start of a code point or not. This implies that \r or \n will never be one of the bytes making up a multi-byte character and looking directly for these bytes will not break up other characters.

@RazrFalcon
Copy link
Owner

Got it. Agree. As long as we have a couple more tests and lxml produces the same output - we should be fine.
It was a long time since I last worked on this project.

For having nicer history I would suggest having two commits:

  1. Remove TextBuffer from process_cdata, aka simplify it + tests
  2. Use memchr in process_cdata, aka opimization

@adamreichold
Copy link
Collaborator Author

Will make it a two-step process, first without TextBuffer using bytes only and then using str::find (leaving out the memchr dependency to keep things simple).

I did have a look the existing CDATA tests and besides the one missing case of multi-byte characters before or after line breaks which I already added, I don't see any glaring omissions. Did you have some specific test case in mind?

@adamreichold
Copy link
Collaborator Author

So added a simplified bytewise implementation with two additional test cases and then replaced it by the blockwise copying one without any changes in the output.

@adamreichold
Copy link
Collaborator Author

adamreichold commented Jan 13, 2025

Hhhmmm, so I had a look at the source of the standard library, c.f. https://doc.rust-lang.org/stable/src/core/str/pattern.rs.html#439, and while memchr is used for single-character patterns, it is still not specialized for single-byte patterns and the multi-character variants do not use memchr at all. So for example, we could not use str::find to meaningfully accelerate the normalization checks for text and attribute values. So, I think I'll go ahead and add another commit for memchr after all.

If I had to go to a lonely island where survival would depend on writing fast parser and I could only take one crate, it would certainly be memchr...

@adamreichold adamreichold changed the title Optimize CDATA normalization using str::find and blockwise copying Optimize CDATA normalization using memchr and blockwise copying Jan 13, 2025
@RazrFalcon
Copy link
Owner

Did you have some specific test case in mind?

A single multi-byte tests would be enough. As long as you think our tests cover all cases - it's fine.

If I had to go to a lonely island where survival would depend on writing fast parser and I could only take one crate, it would certainly be memchr...

😄
I do agree. It's a nice bite-size crate.

@adamreichold
Copy link
Collaborator Author

A single multi-byte tests would be enough. As long as you think our tests cover all cases - it's fine.

Ok, I did add a new multi-byte test and another test with different combinations of \r\n and \n bunched up on the idea of catching off-by-one errors when copying the chunks. So I guess we should be fine for the time being.

@adamreichold adamreichold merged commit 9d6a32f into master Jan 14, 2025
4 checks passed
@adamreichold adamreichold deleted the memchr-cdata branch January 14, 2025 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants