Add HashSet based filtering optimization to XContentMapValues #17160

hye-on · 2025-01-28T13:56:13Z

Description

This optimization enhances document filtering when field names are simple (no dots or wildcards in field names and no dots in document keys). In such cases, it uses a HashSet-based implementation instead of automaton matching to prevent TooComplexToDeterminizeException when processing documents with numerous long field names.

Changes:

Add HashSet optimization for simple field names
Split filter implementation into set-based and automaton-based
Add helper methods to check field name patterns

Reason for Checking Dots in Document Keys

Since dots in field names are treated as sub-objects (e.g., if a document contains a.b as a property and a is an include, then a.b will be kept in the filtered map as per existing JavaDoc below), we check document keys for dots to determine whether to use HashSet-based filtering or fall back to automaton matching.

JavaDoc

* <p>
* Dots in field names are treated as sub objects. So for instance if a
* document contains {@code a.b} as a property and {@code a} is an include,
* then {@code a.b} will be kept in the filtered map.
*/

Note on Testing

Thank you for taking the time to review this PR.
This change focuses on internal optimization, and I believe the existing tests already sufficiently cover the functional scenarios, so I did not add new tests. I considered including tests to verify the implementation selection (HashSet vs. automaton), but I felt such tests might be too closely tied to implementation details, potentially making them fragile. If you believe additional tests are necessary, I would greatly appreciate your feedback.

Comment

I made every effort to ensure the changes align with the style and intent of the existing codebase. If you notice any areas, even minor ones, that could benefit from revision, I would be happy to incorporate your suggestions.
Thank you again for your time and guidance! 😀

Related Issues

Resolves #17114

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This optimization enhances document filtering when field names are simple (no dots or wildcards in field names and no dots in document keys). In such cases, it uses a HashSet-based implementation instead of automaton matching to prevent TooComplexToDeterminizeException when processing documents with numerous long field names. Changes: - Add HashSet optimization for simple field names - Split filter implementation into set-based and automaton-based - Add helper methods to check field name patterns Signed-off-by: hye-on <[email protected]>

github-actions · 2025-01-28T14:58:47Z

✅ Gradle check result for 9c60c7e: SUCCESS

codecov · 2025-01-28T14:59:13Z

Codecov Report

Attention: Patch coverage is 87.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.31%. Comparing base (6fb0c1b) to head (8e14d3b).

Files with missing lines	Patch %	Lines
...rch/common/xcontent/support/XContentMapValues.java	87.50%	0 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #17160      +/-   ##
============================================
+ Coverage     72.24%   72.31%   +0.07%     
- Complexity    65704    65711       +7     
============================================
  Files          5318     5318              
  Lines        305674   305698      +24     
  Branches      44349    44355       +6     
============================================
+ Hits         220834   221075     +241     
+ Misses        66769    66469     -300     
- Partials      18071    18154      +83

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

server/src/main/java/org/opensearch/common/xcontent/support/XContentMapValues.java

…dots Signed-off-by: hye-on <[email protected]>

github-actions · 2025-01-29T14:42:11Z

✅ Gradle check result for 97145a0: SUCCESS

msfroh

Thanks a lot @hye-on ! This looks great.

I checked the code coverage report and the new code is well-covered the existing tests, which is nice.

We've missed the code freeze date for 2.19, so I think this will ship with 3.0, which we're currently planning as the next release.

Can you please add an entry to the CHANGELOG-3.0.md file with an end-user-friendly description of the change? Maybe something like "Use simpler matching logic for source fields when explicit field names (no wildcards or dot-paths) are specified".

Signed-off-by: hye-on <[email protected]>

hye-on · 2025-01-31T15:27:34Z

@msfroh Sorry for the late response! Thanks for the suggested changelog entry—I’ve added it!

github-actions · 2025-01-31T16:18:07Z

❌ Gradle check result for dd9749e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Michael Froh <[email protected]>

github-actions · 2025-01-31T23:12:36Z

✅ Gradle check result for 8e14d3b: SUCCESS

github-actions bot added bug Something isn't working good first issue Good for newcomers labels Jan 28, 2025

msfroh reviewed Jan 28, 2025

View reviewed changes

server/src/main/java/org/opensearch/common/xcontent/support/XContentMapValues.java Outdated Show resolved Hide resolved

Update filtering to support HashSet-based approach for map keys with …

97145a0

…dots Signed-off-by: hye-on <[email protected]>

msfroh reviewed Jan 30, 2025

View reviewed changes

msfroh added the v3.0.0 Issues and PRs related to version 3.0.0 label Jan 30, 2025

Add changelog entry for improved source field matching logic

dd9749e

Signed-off-by: hye-on <[email protected]>

This was referenced Jan 31, 2025

[AUTOCUT] Gradle Check Flaky Test Report for AzureBlobStoreRepositoryTests #14291

Open

[AUTOCUT] Gradle Check Flaky Test Report for RemoteRestoreSnapshotIT #14324

Open

Merge branch 'main' into source-field-optimization

8e14d3b

Signed-off-by: Michael Froh <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HashSet based filtering optimization to XContentMapValues #17160

Add HashSet based filtering optimization to XContentMapValues #17160

hye-on commented Jan 28, 2025

github-actions bot commented Jan 28, 2025

codecov bot commented Jan 28, 2025 •

edited

Loading

github-actions bot commented Jan 29, 2025

msfroh left a comment

hye-on commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

Add HashSet based filtering optimization to XContentMapValues #17160

Are you sure you want to change the base?

Add HashSet based filtering optimization to XContentMapValues #17160

Conversation

hye-on commented Jan 28, 2025

Description

Reason for Checking Dots in Document Keys

Note on Testing

Comment

Related Issues

Check List

github-actions bot commented Jan 28, 2025

codecov bot commented Jan 28, 2025 • edited Loading

Codecov Report

github-actions bot commented Jan 29, 2025

msfroh left a comment

Choose a reason for hiding this comment

hye-on commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

codecov bot commented Jan 28, 2025 •

edited

Loading