Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for intra.broker.goals in anomaly detection / self healing #6

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

ecojan
Copy link

@ecojan ecojan commented Dec 15, 2021

This PR resolves linkedin#1264 by enabling the self-healing for intra-broker goals. This however does not account for race conditions between Cluster level Goals and will only trigger after said goals are finished.
The scope of this PR is to allow self-healing on IntraBroker goals for now (as future work might make this obsolete as intra-broker goals and regular goals might be made to work together).

New configuration:

anomaly.detection.intra.broker.goals= List of intra-broker goals for the Anomaly detector to look for
self.healing.intra.broker.goals= List of Intra broker Goals for self-healing
self.healing.intra.broker.goal.violation.enabled= If the self-healing should be enabled for intra-broker-goals

New Anomaly Type: INTRA_BROKER_GOAL_VIOLATION

A new type of Anomaly that needs to be treated differently from the regular GOAL_VIOLATIONS.
When configuring for the rebalance parameters for this violation, we are skipping hard goals check and ignoring proposalsCache.
We are skipping hard goals check because intra-broker goals may not be hard goals (at the current given time).
We are ignoring proposalsCache because the proposals model takes into account all given default goals. If we add the intraBrokersGoals this will not work properly as currently, disk-granularity goals do not work with broker-granularity goals.

New Anomaly detectors: IntraBrokerGoalViolationDetector

This anomaly detector is only looking for the anomaly.detection.intra.broker.goal and unlike the GoalViolationDetector, it's using a cluster model that is generating replica placement on disks as well.
This anomaly detector mainly functions for INTRA_BROKER_GOAL_VIOLATIONS.

What is missing from this feature

  • missing representation on INTRA_BROKER_GOAL_VIOLATIONS in CruiseControl UI when an anomaly is detected (as this is a structure in cruise-control-ui project)

@amuraru
Copy link

amuraru commented Jan 5, 2022

@ecojan one particular aspect which is not yet clear to me is the order of the INTRA and INTER broker goal violations.
is the new intra-broker-goal-violation detector/healer run before the intra one?

@ecojan
Copy link
Author

ecojan commented Jan 10, 2022

@ecojan one particular aspect which is not yet clear to me is the order of the INTRA and INTER broker goal violations. is the new intra-broker-goal-violation detector/healer run before the intra one?

My expectancy is GOAL_VIOLATIONS followed by INTRA_BROKER_GOAL_VIOLATIONS.
As stated on

scheduleDetectorAtFixedRate(INTRA_BROKER_GOAL_VIOLATION, _intraBrokerGoalViolationDetector);

the detectors are scheduled on a fixed rate and in order (have not checked the inner workings for maintaining order on ScheduledExecutorService from the java.util.concurrent package).
For INTRA broker goal violation detector, we reuse the same anomalyDetection interval from the INTER broker goal violation detector.

@amuraru
Copy link

amuraru commented Jan 11, 2022

thanks for the pointer - yeah - they all seem to be scheduled concurrently with the same (default) interval.
This means that whichever detector finishes first will be the one scheduling potential fix tasks.
My idea was if possible to consider running INTRA_BROKER_GOAL_VIOLATIONS more frequently and thus giving it a chance to run before the GOAL_VIOLATIONS one - the reason being that it can level the usage within a broker before other paritions are migrated to it as part of the global goal_violation detector.
but this is not mandatory at this stage.

@amuraru amuraru force-pushed the master branch 3 times, most recently from a2437a7 to be9586f Compare March 14, 2022 09:30
@amuraru amuraru force-pushed the master branch 2 times, most recently from ebd4582 to 2c8caf9 Compare March 30, 2022 10:39
@amuraru amuraru force-pushed the master branch 2 times, most recently from 45eb48f to c399102 Compare May 13, 2022 21:47
amuraru added 5 commits July 6, 2022 16:40
See: linkedin#1242
This is an internal patch to use SystemCpuLoad to monitor broker pod
cpu usage knowing that we always run the broker in a container.

As `SystemCpuLoad` reports un-normalized cpu load across all cores
we do this normalization to match CC expected value in [0, 1] interval.
In our environment we've seen a case where RackAwareDistributionGoal
and MinTopicLeaderPerBrokerGoal are conflicting.
As both are hard goals the rebalance fails

Lowering MinTopicLeaderPerBrokerGoal as a soft goal
@ecojan
Copy link
Author

ecojan commented Jul 13, 2022

did a rebase, waiting for build status

@ecojan ecojan force-pushed the intra-broker-self-healing branch from a1e2f2c to 38f7b74 Compare July 13, 2022 11:55
Copy link

@dobrerazvan dobrerazvan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@@ -553,7 +590,12 @@ public void run() {
// We compute the proposal even if there is not enough modeled partitions.
ModelCompletenessRequirements requirements = _loadMonitor.meetCompletenessRequirements(_defaultModelCompletenessRequirements)
? _defaultModelCompletenessRequirements : _requirementsWithAvailableValidWindows;
ClusterModel clusterModel = _loadMonitor.clusterModel(_time.milliseconds(), requirements, _allowCapacityEstimation, operationProgress);

ClusterModel clusterModel = null;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the cluster model if first set to null and afterwards with an actual value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to use intra.broker.goals in anomaly detection / self healing
4 participants