Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine: Regex: Avoid compiling and storing the same regex multiple times #98

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

arilou
Copy link

@arilou arilou commented Dec 19, 2024

Previously, the engine was compiling the same regex multiple times during execution. This redundant compilation process led to unnecessary performance overhead and increased resource consumption.

With this change, we store the compiled regex in a HashSet. If the same regex is used more than once, the engine reuses the previously compiled instance.

Previously, the engine was compiling the same regex multiple times
during execution. This redundant compilation process led to unnecessary
performance overhead and increased resource consumption.

With this change, we store the compiled regex in a `HashSet`. If the
same regex is used more than once, the engine reuses the previously
compiled instance.

Signed-off-by: Jon Doron <[email protected]>
@marmeladema
Copy link
Collaborator

marmeladema commented Dec 19, 2024

Hello and thank you for your proposal.

Unfortunately, we aren't going to merge this approach because we would prefer consumers of wirefilter to be in control of the cache they are using.

But the use case your perfectly reasonable :) We have been working internally on a huge refactor which would only parse regex patterns using regex-syntax during filter parsing. Regex would then be compiled during filter compilation which is customizable using the Compiler trait. This is our current plan but we don't have an ETA at the moment.
It requires some preliminary steps:

  1. Storing spans for each AST node somehow (we have that part 90% done)
  2. Make the filter compilation fallible, so that regex compilation can return an error using the span of the regex pattern
  3. Improve the compiler trait to support customization of the regex object

@arilou
Copy link
Author

arilou commented Dec 19, 2024

Thank you so much for the fast reply! I started going over all the changes in this last sync, I noticed you added wildcard and wildcard strict, is there perhaps some document that summarizes some of the major changes?

As for this PR, I understand and looking forward to work with what you have described. I was wondering if in the mean while if ill put this change under a feature flag will help?
Something along the lines of this diff:
master...arilou:wirefilter:new_wiz

There are are few other changes we were hoping to present to you, we will be working on them in the coming week and send you a PR to review.

@marmeladema
Copy link
Collaborator

I was wondering if in the mean while if ill put this change under a feature flag will help?

I am very sorry about my answer, but no, we aren't going to merge changes that we aren't going to support and use internally.

There are are few other changes we were hoping to present to you, we will be working on them in the coming week and send you a PR to review.

Probably best to open issues about what you want/need first so that we can discuss ahead of time what is the best way forward.

@RReverser
Copy link
Contributor

Previously, the engine was compiling the same regex multiple times during execution.

I don't think it does. As long as you store and reuse the Filter instance, the compiled regex should already be cached as part of it.

@marmeladema
Copy link
Collaborator

That's true but it's fairly common to use the same regex in different and unrelated filters.

@RReverser
Copy link
Contributor

Ah yeah, in those scenarios caching is definitely up to the user.

@arilou
Copy link
Author

arilou commented Dec 19, 2024

In our use cases we have many different rules each one is a filter, but from your conversation it sounds like I might be using Wirefilter wrong.

Is not it 1 rule to 1 filter? I mean one could do an OR between all the different conditions to build a single giant filter, but then how would you know which rules were triggered, and the moment a single "sub-rule" is triggered the evaluation would stop.

Am I missing something in the way I currently work with Wirefilter?

@arilou
Copy link
Author

arilou commented Dec 19, 2024

Regardless a small issue I ran into few days ago is the signature of get_field_value, which uses the wrong lifetime
(Not sending a PR since you guys are working internally on your git anyway) but the change I'm suggesting is:

diff --git a/engine/src/execution_context.rs b/engine/src/execution_context.rs
index 32a1907..a226cbe 100644
--- a/engine/src/execution_context.rs
+++ b/engine/src/execution_context.rs
@@ -132,7 +132,7 @@ impl<'e, U> ExecutionContext<'e, U> {
     }

     /// Get the value of a field.
-    pub fn get_field_value<'s>(&self, field: Field<'s>) -> Option<&LhsValue<'_>> {
+    pub fn get_field_value<'s>(&self, field: Field<'s>) -> Option<&LhsValue<'e>> {
         assert!(self.scheme() == field.scheme());

         self.values[field.index()].as_ref()

This allows doing stuff like "copying" a field to another field

@arilou
Copy link
Author

arilou commented Jan 2, 2025

We ran into some performance issue in our end to end tests with the new wirefilter, I was wondering if it would be possible for you guys to publish a branch with all the git commits rather than a single squashed one. so we can run a bisec easier to try and spot what is causing the issue.

Thanks and happy new year!

@marmeladema
Copy link
Collaborator

Unfortunately no, it's too tedious to analyze each commit and verify what information is contained in the commit and if it's publicly shareable.

Alternatively, if you can reproduce the problem with some standalone code / script etc, we can probably run it internally and try to figure out the problem.

That being said, if it's related to regexes, Wirefilter is probably not the problem since we are just a very simple consumer of it. The regex crate has been substantially rewritten in the last year and a half and we have found some performance regressions with it but never had time to reproduce in isolation in publicly shareable test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants