-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix repeated migration of colocate tablets #53099
base: main
Are you sure you want to change the base?
Conversation
if (!olapTbl.needSchedule(isLocalBalance)) { | ||
continue; | ||
} | ||
|
||
for (Partition partition : globalStateMgr.getLocalMetastore().getAllPartitionsIncludeRecycleBin(olapTbl)) { | ||
partitionChecked++; | ||
if (partitionChecked % partitionBatchNum == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
The changes are adding calls to olapTable.needSchedule()
and olapTbl.needSchedule()
, leading to skipped iterations when scheduling is not needed. If needSchedule
logic is incorrect, or if skipping these parts without proper handling causes data imbalance or system performance issues, it may introduce a risk of improperly balanced clusters or missed scheduling opportunities.
You can modify the code like this:
// Assess if the conditions under which 'needSchedule' returns false might lead to critical operations being skipped inadvertently,
// and ensure that either strategic logging or alternative operations handle such cases to preserve cluster health.
private List<TabletSchedCtx> balanceClusterDisk(ClusterLoadStatistic clusterStat) {
for (/* iteration variables */) {
if (!olapTable.needSchedule(false)) {
// Add logging or checks to confirm skipping doesn't cause issues
continue;
}
if (isDestBackendLocationMismatch(olapTable, hBackend.getId(), lBackend.getId(),
physicalPartition.getParentId(), tabletId)) {
continue;
}
// rest of the function...
}
}
private void balanceBackendDisk(TStorageMedium medium, double avgUsedPercent) {
for (/* iteration variables */) {
if (olapTable == null) {
continue;
}
if (!olapTable.needSchedule(true)) {
// Add logging or checks to assess impact
continue;
}
if (isTabletUnhealthy(tabletMeta.getDbId(), olapTable, tabletId, tabletMeta, aliveBeIds)) {
continue;
}
// rest of the function...
}
}
private Map<Pair<Long, Long>, PartitionStat> getPartitionStats(TStorageMedium medium, boolean isLocalBalance) {
for (/* iteration variables */) {
if (!olapTbl.needSchedule(isLocalBalance)) {
// Consider implementing fallback checks here
continue;
}
for (Partition partition : globalStateMgr.getLocalMetastore().getAllPartitionsIncludeRecycleBin(olapTbl)) {
partitionChecked++;
if (partitionChecked % partitionBatchNum == 0) {
// additional logic...
}
}
// rest of the function...
}
}
Ensure you validate whether operations that are bypassed should still partially execute or trigger alerts/logs for evaluation.
b8619d0
to
6e049a1
Compare
Quality Gate passedIssues Measures |
Head branch was pushed to by a user without write access
fe/fe-core/src/main/java/com/starrocks/clone/DiskAndTabletLoadReBalancer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add some ut
@mxdzs0612 You can just judge it in getPartitionTablets function. This is the base function to get the tablets |
[FE Incremental Coverage Report]❌ fail : 3 / 5 (60.00%) file detail
|
@@ -984,6 +988,11 @@ private Map<Pair<Long, Long>, Set<Long>> getPartitionTablets(long beId, TStorage | |||
} | |||
} | |||
|
|||
OlapTable olapTable = getOlapTableById(tabletMeta.getDbId(), tabletMeta.getTableId()); | |||
if (olapTable != null && !olapTable.needSchedule(true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a balance between clusters to call getPartitionTablets(long beId, TStorageMedium medium, long pathHash). So you cannot just set isLocalBalance to true.
The another function getPartitionTablets(Long dbId, Long tableId, Long physicalPartitionId, Long indexId, List beIds, Pair<Long, List> bePaths) should be changed too.
Signed-off-by: Jiao Mingye <[email protected]>
Quality Gate passedIssues Measures |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
return result; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can change it to
if (!table.needSchedule(beIds == null)) {
return result;
}
Why I'm doing:
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: