Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HashSet based filtering optimization to XContentMapValues #17160

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

hye-on
Copy link
Contributor

@hye-on hye-on commented Jan 28, 2025

Description

This optimization enhances document filtering when field names are simple (no dots or wildcards in field names and no dots in document keys). In such cases, it uses a HashSet-based implementation instead of automaton matching to prevent TooComplexToDeterminizeException when processing documents with numerous long field names.

Changes:

  • Add HashSet optimization for simple field names
  • Split filter implementation into set-based and automaton-based
  • Add helper methods to check field name patterns

Reason for Checking Dots in Document Keys

Since dots in field names are treated as sub-objects (e.g., if a document contains a.b as a property and a is an include, then a.b will be kept in the filtered map as per existing JavaDoc below), we check document keys for dots to determine whether to use HashSet-based filtering or fall back to automaton matching.

JavaDoc

* <p>
* Dots in field names are treated as sub objects. So for instance if a
* document contains {@code a.b} as a property and {@code a} is an include,
* then {@code a.b} will be kept in the filtered map.
*/

Note on Testing

Thank you for taking the time to review this PR.
This change focuses on internal optimization, and I believe the existing tests already sufficiently cover the functional scenarios, so I did not add new tests. I considered including tests to verify the implementation selection (HashSet vs. automaton), but I felt such tests might be too closely tied to implementation details, potentially making them fragile. If you believe additional tests are necessary, I would greatly appreciate your feedback.

Comment

I made every effort to ensure the changes align with the style and intent of the existing codebase. If you notice any areas, even minor ones, that could benefit from revision, I would be happy to incorporate your suggestions.
Thank you again for your time and guidance! 😀

Related Issues

Resolves #17114

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This optimization enhances document filtering when field names are simple (no dots or wildcards in field names and no dots in document keys). In such cases, it uses a HashSet-based implementation instead of automaton matching to prevent TooComplexToDeterminizeException when processing documents with numerous long field names.

Changes:
- Add HashSet optimization for simple field names
- Split filter implementation into set-based and automaton-based
- Add helper methods to check field name patterns

Signed-off-by: hye-on <[email protected]>
Copy link
Contributor

✅ Gradle check result for 9c60c7e: SUCCESS

Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 87.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.31%. Comparing base (6fb0c1b) to head (8e14d3b).

Files with missing lines Patch % Lines
...rch/common/xcontent/support/XContentMapValues.java 87.50% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17160      +/-   ##
============================================
+ Coverage     72.24%   72.31%   +0.07%     
- Complexity    65704    65711       +7     
============================================
  Files          5318     5318              
  Lines        305674   305698      +24     
  Branches      44349    44355       +6     
============================================
+ Hits         220834   221075     +241     
+ Misses        66769    66469     -300     
- Partials      18071    18154      +83     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

✅ Gradle check result for 97145a0: SUCCESS

Copy link
Collaborator

@msfroh msfroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @hye-on ! This looks great.

I checked the code coverage report and the new code is well-covered the existing tests, which is nice.

We've missed the code freeze date for 2.19, so I think this will ship with 3.0, which we're currently planning as the next release.

Can you please add an entry to the CHANGELOG-3.0.md file with an end-user-friendly description of the change? Maybe something like "Use simpler matching logic for source fields when explicit field names (no wildcards or dot-paths) are specified".

@msfroh msfroh added the v3.0.0 Issues and PRs related to version 3.0.0 label Jan 30, 2025
@hye-on
Copy link
Contributor Author

hye-on commented Jan 31, 2025

@msfroh Sorry for the late response! Thanks for the suggested changelog entry—I’ve added it!

Copy link
Contributor

❌ Gradle check result for dd9749e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 8e14d3b: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Fetching source uses automata even for simple matching
2 participants