-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CDATA normalization using memchr and blockwise copying #133
Conversation
4d6cf44
to
8047ec2
Compare
8047ec2
to
54b9f68
Compare
Dependencies - bad. Are you sure there is no other way for use to improve it? |
54b9f68
to
506e927
Compare
There are other ways, see #128 or one could implement the same chunking strategy using std's So while I agree that dependencies should never be taken on lightly, the performance implications of having access to |
Gave this a try and it is still a significant improvement if not on par with
|
I don't mind Also, I'm not sure your code is equivalent. The existing one operates on Unicode codepoints, not bytes. We should make sure we're not breaking non-Latin XMLs. |
As written, CDATA was just a low hanging fruit from my PoV. If we do add
For CDATA only line endings are normalized so the relevant part of the specification is
And since both
To be honest, I think this code was copied over from text processing and then slimmed down. For example, the So in summary, since you are unsure the code is equivalent and it is a significant improvement even using just |
So I added <svg>
<style><![CDATA[some
🐉
👃
]]></style>
</svg> as a test case which requires line endings to be normalized and it behaves exactly the same on master as here. |
I fear stating something well-known here, but the underlying invariant why |
Got it. Agree. As long as we have a couple more tests and For having nicer history I would suggest having two commits:
|
Will make it a two-step process, first without I did have a look the existing CDATA tests and besides the one missing case of multi-byte characters before or after line breaks which I already added, I don't see any glaring omissions. Did you have some specific test case in mind? |
e6284a6
to
0f4e595
Compare
So added a simplified bytewise implementation with two additional test cases and then replaced it by the blockwise copying one without any changes in the output. |
0f4e595
to
44db645
Compare
Hhhmmm, so I had a look at the source of the standard library, c.f. https://doc.rust-lang.org/stable/src/core/str/pattern.rs.html#439, and while If I had to go to a lonely island where survival would depend on writing fast parser and I could only take one crate, it would certainly be |
… compared to the standard library.
A single multi-byte tests would be enough. As long as you think our tests cover all cases - it's fine.
😄 |
Ok, I did add a new multi-byte test and another test with different combinations of |
memchr
should be a well-known quantity as a dependency and provides ample optimization opportunities throughout the parser. For example, adjusting the normalization of line endings to use yields the following improvement for processing 1 MB of random CDATA:This would also supersede #128