-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Reduce the size of data written to edit log when schema changes in shared-data clusters with fast schema evolution enabled. #55282
base: main
Are you sure you want to change the base?
Conversation
98ff48e
to
62a054d
Compare
…ared-data clusters with fast schema evolution enabled. Signed-off-by: zjial <[email protected]>
62a054d
to
e03ced8
Compare
|
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 13 / 13 (100.00%) file detail
|
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
public void write(DataOutput out) throws IOException { | ||
String json; | ||
|
||
if (!hasWriteSchema) { | ||
hasWriteSchema = true; | ||
json = GsonUtils.GSON.toJson(this); | ||
} else { | ||
List<IndexSchemaInfo> tmpSchemaInfos = this.schemaInfos; | ||
this.schemaInfos = null; | ||
json = GsonUtils.GSON.toJson(this); | ||
this.schemaInfos = tmpSchemaInfos; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too hack and hard to maintain.
I think you can add edit log type to optimize this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your reply. It seems that writing the edit log finally calls the class's write
method for serialization. Without overriding the write
method, can this problem be optimized by adding an edit log type? Maybe we can ensure the safety of object modifications using try-catch?
How much can this PR reduce the execution time? |
The background of this problem is that our character encoding configuration is causing table column comments to bloat. Consequently, during the execution of schema change tasks, the multiple writes of schema information result in a substantial increase in execution time. By reducing the times of writing schema information, we can significantly decrease the execution time. However, for tables with simple structures, the performance improvement is not noticeable. |
Why I'm doing:
When columns are added or dropped, the schema change job in shared-data clusters with fast schema evolution enabled is written to the edit log. In total, there are four write operations, and each operation writes the same
schemaInfos
. This introduces additional overhead for wide tables with a large number of column comments.What I'm doing:
Write schemaInfos only once.
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: