Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix realtime get of nested fields with synthetic source #119575

Merged
merged 6 commits into from
Jan 16, 2025

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jan 6, 2025

Today, for get-from-translog operations, we only need to reindex the root document into an in-memory Lucene, as the _source is stored in the root document and is sufficient. However, synthesizing the source for nested fields requires both the root document and its child documents. This causes realtime-get operations (as well as update and update-by-query operations) to miss nested fields.

Another issue is that the translog operation is reindexed lazily during get-from-translog operations. As a result, two realtime-get operations can return slightly different outputs: one reading from the translog and the other from Lucene.

This change resolves both issues. However, addressing the second issue can degrade the performance of realtime-get and update operations. If slight inconsistencies are acceptable, the translog operation should be reindexed lazily instead.

Closes #119553

@dnhatn dnhatn force-pushed the realtime-get-synthetic-source branch 2 times, most recently from 10bd8c0 to 979a327 Compare January 6, 2025 17:57
@dnhatn dnhatn force-pushed the realtime-get-synthetic-source branch from 979a327 to 775663e Compare January 6, 2025 20:06
@dnhatn dnhatn added >bug v8.18.0 v8.17.1 :StorageEngine/Mapping The storage related side of mappings labels Jan 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn added the auto-backport Automatically create backport pull requests when merged label Jan 6, 2025
@dnhatn dnhatn marked this pull request as ready for review January 6, 2025 20:25
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nhat, left a few clarifying questions.

Comment on lines +199 to +200
numDocs = parsedDocs.docs().size();
writer.addDocuments(parsedDocs.docs());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure I understand why this is necessary for synthetic source and not for regular source, perhaps you can help me out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synthesizing the source of nested fields requires both the root document and its child documents, whereas the stored source only requires the root document:

collectChildren(parentDoc, parentDocs, childScorer.iterator());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I wonder about the field fetch support that is in the GET API, but I suppose it is restricted to not be able to access nested objects in any way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about the field fetch support that is in the GET API, but I suppose it is restricted to not be able to access nested objects in any way?

I don't think including stored fields of nested (child) documents works today. Looks like the root docid is used in ShardGetService#innerGetFetch(...).

Comment on lines +97 to +98
// When using synthetic source, the translog operation must always be reindexed into an in-memory Lucene to ensure consistent
// output for realtime-get operations. However, this can degrade the performance of realtime-get and update operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconsistencies would those purely be white-space and field order or also actual content like order in arrays etc?

I think this is ok, synthetic source and heavy real-time GET / update use cases are probably rare, but I'd like to understand the tradeoff more deeply.

If it is only white-space and field order, could we sort that instead and how difficult would that be? Not asking to implement that now, more about knowing our backdoor in case we need this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the inconsistencies also include removing duplicate values in an array and returning a single value as an object instead of an array. For example:

{ 
	"f": ["v"]
}

becomes

{
	"f": "v"
}

Or:

{
	"f": ["v1", "v1", "v2"]
}

becomes

{
	"f": ["v1", "v2"]
}

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@dnhatn
Copy link
Member Author

dnhatn commented Jan 8, 2025

@henningandersen Thank you for reviewing. I wanted to double-check if you're comfortable with the trade-off decision in this PR. I think the alternative approach - trading slight inconsistency for better performance - is also acceptable. While the implementation for that approach is slightly more complex than this PR, I'm happy to make changes if you think it's the better trade-off.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +199 to +200
numDocs = parsedDocs.docs().size();
writer.addDocuments(parsedDocs.docs());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about the field fetch support that is in the GET API, but I suppose it is restricted to not be able to access nested objects in any way?

I don't think including stored fields of nested (child) documents works today. Looks like the root docid is used in ShardGetService#innerGetFetch(...).

@dnhatn dnhatn removed the v8.17.2 label Jan 15, 2025
@dnhatn dnhatn merged commit 7b8f545 into elastic:main Jan 16, 2025
15 of 16 checks passed
@dnhatn dnhatn deleted the realtime-get-synthetic-source branch January 16, 2025 00:18
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 16, 2025
Today, for get-from-translog operations, we only need to reindex the
root document into an in-memory Lucene, as the _source is stored in the
root document and is sufficient. However, synthesizing the source for
nested fields requires both the root document and its child documents.
This causes realtime-get operations (as well as update and
update-by-query operations) to miss nested fields.

Another issue is that the translog operation is reindexed lazily during
get-from-translog operations. As a result, two realtime-get operations
can return slightly different outputs: one reading from the translog and
the other from Lucene.

This change resolves both issues. However, addressing the second issue
can degrade the performance of realtime-get and update operations. If
slight inconsistencies are acceptable, the translog operation should be
reindexed lazily instead.

Closes elastic#119553
dnhatn added a commit that referenced this pull request Jan 16, 2025
…20247)

Today, for get-from-translog operations, we only need to reindex the
root document into an in-memory Lucene, as the _source is stored in the
root document and is sufficient. However, synthesizing the source for
nested fields requires both the root document and its child documents.
This causes realtime-get operations (as well as update and
update-by-query operations) to miss nested fields.

Another issue is that the translog operation is reindexed lazily during
get-from-translog operations. As a result, two realtime-get operations
can return slightly different outputs: one reading from the translog and
the other from Lucene.

This change resolves both issues. However, addressing the second issue
can degrade the performance of realtime-get and update operations. If
slight inconsistencies are acceptable, the translog operation should be
reindexed lazily instead.

Closes #119553
@elastic elastic deleted a comment from elasticsearchmachine Jan 16, 2025
@dnhatn
Copy link
Member Author

dnhatn commented Jan 16, 2025

Backported to 8.18 in #120247

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] LuceneSyntheticSourceChangesSnapshotTests testSkipNonRootOfNestedDocuments failing
4 participants