-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix realtime get of nested fields with synthetic source #119575
Conversation
10bd8c0
to
979a327
Compare
979a327
to
775663e
Compare
Hi @dnhatn, I've created a changelog YAML for you. |
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nhat, left a few clarifying questions.
numDocs = parsedDocs.docs().size(); | ||
writer.addDocuments(parsedDocs.docs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure I understand why this is necessary for synthetic source and not for regular source, perhaps you can help me out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synthesizing the source of nested fields requires both the root document and its child documents, whereas the stored source only requires the root document:
elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/NestedObjectMapper.java
Line 466 in 12e86b1
collectChildren(parentDoc, parentDocs, childScorer.iterator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I wonder about the field fetch support that is in the GET API, but I suppose it is restricted to not be able to access nested objects in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about the field fetch support that is in the GET API, but I suppose it is restricted to not be able to access nested objects in any way?
I don't think including stored fields of nested (child) documents works today. Looks like the root docid is used in ShardGetService#innerGetFetch(...)
.
// When using synthetic source, the translog operation must always be reindexed into an in-memory Lucene to ensure consistent | ||
// output for realtime-get operations. However, this can degrade the performance of realtime-get and update operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inconsistencies would those purely be white-space and field order or also actual content like order in arrays etc?
I think this is ok, synthetic source and heavy real-time GET / update use cases are probably rare, but I'd like to understand the tradeoff more deeply.
If it is only white-space and field order, could we sort that instead and how difficult would that be? Not asking to implement that now, more about knowing our backdoor in case we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the inconsistencies also include removing duplicate values in an array and returning a single value as an object instead of an array. For example:
{
"f": ["v"]
}
becomes
{
"f": "v"
}
Or:
{
"f": ["v1", "v1", "v2"]
}
becomes
{
"f": ["v1", "v2"]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@henningandersen Thank you for reviewing. I wanted to double-check if you're comfortable with the trade-off decision in this PR. I think the alternative approach - trading slight inconsistency for better performance - is also acceptable. While the implementation for that approach is slightly more complex than this PR, I'm happy to make changes if you think it's the better trade-off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
numDocs = parsedDocs.docs().size(); | ||
writer.addDocuments(parsedDocs.docs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about the field fetch support that is in the GET API, but I suppose it is restricted to not be able to access nested objects in any way?
I don't think including stored fields of nested (child) documents works today. Looks like the root docid is used in ShardGetService#innerGetFetch(...)
.
Today, for get-from-translog operations, we only need to reindex the root document into an in-memory Lucene, as the _source is stored in the root document and is sufficient. However, synthesizing the source for nested fields requires both the root document and its child documents. This causes realtime-get operations (as well as update and update-by-query operations) to miss nested fields. Another issue is that the translog operation is reindexed lazily during get-from-translog operations. As a result, two realtime-get operations can return slightly different outputs: one reading from the translog and the other from Lucene. This change resolves both issues. However, addressing the second issue can degrade the performance of realtime-get and update operations. If slight inconsistencies are acceptable, the translog operation should be reindexed lazily instead. Closes elastic#119553
…20247) Today, for get-from-translog operations, we only need to reindex the root document into an in-memory Lucene, as the _source is stored in the root document and is sufficient. However, synthesizing the source for nested fields requires both the root document and its child documents. This causes realtime-get operations (as well as update and update-by-query operations) to miss nested fields. Another issue is that the translog operation is reindexed lazily during get-from-translog operations. As a result, two realtime-get operations can return slightly different outputs: one reading from the translog and the other from Lucene. This change resolves both issues. However, addressing the second issue can degrade the performance of realtime-get and update operations. If slight inconsistencies are acceptable, the translog operation should be reindexed lazily instead. Closes #119553
Backported to 8.18 in #120247 |
Today, for get-from-translog operations, we only need to reindex the root document into an in-memory Lucene, as the
_source
is stored in the root document and is sufficient. However, synthesizing the source for nested fields requires both the root document and its child documents. This causes realtime-get operations (as well as update and update-by-query operations) to miss nested fields.Another issue is that the translog operation is reindexed lazily during get-from-translog operations. As a result, two realtime-get operations can return slightly different outputs: one reading from the translog and the other from Lucene.
This change resolves both issues. However, addressing the second issue can degrade the performance of realtime-get and update operations. If slight inconsistencies are acceptable, the translog operation should be reindexed lazily instead.
Closes #119553