Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Reduce the size of data written to edit log when schema changes in shared-data clusters with fast schema evolution enabled. #55282

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zjial
Copy link

@zjial zjial commented Jan 21, 2025

Why I'm doing:

When columns are added or dropped, the schema change job in shared-data clusters with fast schema evolution enabled is written to the edit log. In total, there are four write operations, and each operation writes the same schemaInfos. This introduces additional overhead for wide tables with a large number of column comments.

What I'm doing:

Write schemaInfos only once.
Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

@CLAassistant
Copy link

CLAassistant commented Jan 21, 2025

CLA assistant check
All committers have signed the CLA.

@mergify mergify bot assigned zjial Jan 21, 2025
@zjial zjial marked this pull request as ready for review January 21, 2025 09:34
@zjial zjial requested a review from a team as a code owner January 21, 2025 09:34
@zjial zjial changed the title Reduce the size of data written to edit log when schema changes in shared-data clusters with fast schema evolution enabled. [Enhancement] Reduce the size of data written to edit log when schema changes in shared-data clusters with fast schema evolution enabled. Jan 21, 2025
@github-actions github-actions bot added 3.4 and removed 3.4 labels Feb 8, 2025
…ared-data clusters with fast schema evolution enabled.

Signed-off-by: zjial <[email protected]>
Copy link

Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 13 / 13 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/alter/LakeTableAsyncFastSchemaChangeJob.java 13 13 100.00% []

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

Comment on lines +157 to +168
public void write(DataOutput out) throws IOException {
String json;

if (!hasWriteSchema) {
hasWriteSchema = true;
json = GsonUtils.GSON.toJson(this);
} else {
List<IndexSchemaInfo> tmpSchemaInfos = this.schemaInfos;
this.schemaInfos = null;
json = GsonUtils.GSON.toJson(this);
this.schemaInfos = tmpSchemaInfos;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too hack and hard to maintain.
I think you can add edit log type to optimize this problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reply. It seems that writing the edit log finally calls the class's write method for serialization. Without overriding the write method, can this problem be optimized by adding an edit log type? Maybe we can ensure the safety of object modifications using try-catch?

@sevev
Copy link
Contributor

sevev commented Feb 12, 2025

How much can this PR reduce the execution time?

@zjial
Copy link
Author

zjial commented Feb 12, 2025

How much can this PR reduce the execution time?

The background of this problem is that our character encoding configuration is causing table column comments to bloat. Consequently, during the execution of schema change tasks, the multiple writes of schema information result in a substantial increase in execution time. By reducing the times of writing schema information, we can significantly decrease the execution time. However, for tables with simple structures, the performance improvement is not noticeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants