Getting progress and final status for a self-healing fix related to a detected anomaly #2215

ppatierno · 2024-11-05T17:12:36Z

Hi all,
I was looking at a way for getting the current status of the self-healing fix in progress for a detected anomaly.
AFAIU from the code (mostly looking at the Executor class and the usage of the _userTaskManager) , when a task runs because it was triggered by a detected anomaly, this task is not a user task (of course!), it doesn't have a corresponding UserTaskInfo instance so it won't show up in the /user_tasks endpoint.
So I was looking at using the /state?json=true&substates=anomaly_detector endpoint (which provides info via the AnomalyDetectorState class) but also in this case, for each anomaly (in the cache) the last status is FIX_STARTED and there is no FIX_DONE. So when I have the "anomalyIdthe only way I see is about searching in the reported JSON for the anomaly with such id andFIX_STARTEDstatus and crossing this information with theongoingSelfHealingAnomaly field (if it reports the same anomaly id). But I see this not a great workaround to get a fix for an anomaly is running (ongoingSelfHealingAnomalyis filled with anomaly id) or it ended (the anomaly was FIX_STARTED but nowongoingSelfHealingAnomaly` is empty or doesn't exist anymore).
Is there any better way by using Cruise Control REST API I can't see?
Also, is it possible to get the optimization proposal related to the fix for the detected anomaly?

Also, I noticed that the AnomalyDetectorState class has a reference to the implementation of a notifier, so I was thinking that even the AnomalyNotifier interface could have an additional method being called when the fix was done (through the markSelfHealingFinished). What do you think about this as well?

Thanks!

The text was updated successfully, but these errors were encountered:

ppatierno · 2024-11-06T10:47:49Z

The other way I found was requesting on the /state?json=true&substates=executor endpoint and looking at the triggeredSelfHealingTaskId field and if matches with the anomaly id I am looking for. In this case the only way to know that the anomaly was fixed is getting a NO_TAST_IN_PROGRESS, an empty triggeredSelfHealingTaskId (maybe the running task is related to a user request) or a non empty triggeredSelfHealingTaskId but with a different value from the anomaly id I am looking for (it would mean that anomaly was fixed and now executor is running a task to fix a new one).

Despite that I still think that having the notifier interface being called by the AnomalyDetectorState when an anomaly fix ends would be useful. Is it something you are willing to accept as contribution on the project?

ppatierno · 2024-11-06T14:46:02Z

@CCisGG I was wondering what do you think about the above. Thanks!

CCisGG · 2024-11-06T18:54:34Z

I'm not entirely sure what you are trying to achieve here. My understand is you probably want to know the self-healing triggered rebalance states during/after the rebalance. Here is what I think:

During rebalance, you definitely can see the execution progress with /state?json=true&substates=executor endpoint. I'm not sure if it today shows you the anomaly id. If not, I'm ok with add such information.
After rebalance, I think you are right that there is no good way to check it as it is not a user task, but I think you can still see the rebalance info from both cruise-control.log and cruise-control-operation.log. Do you think if that's good enough for your case?

cc: @mhratson @allenxwang

ppatierno · 2024-11-07T09:16:57Z

Hey @CCisGG first of all thank you very much for your prompt answer!

During rebalance, you definitely can see the execution progress with /state?json=true&substates=executor endpoint. I'm not sure if it today shows you the anomaly id. If not, I'm ok with add such information.

Currently the executor status JSON already has a triggeredSelfHealingTaskId field where you have the anomaly id so I can use it to get the status.

After rebalance, I think you are right that there is no good way to check it as it is not a user task, but I think you can still see the rebalance info from both cruise-control.log and cruise-control-operation.log. Do you think if that's good enough for your case?

In general, I don't think that looking at logs is a great UX. I would expect that people using Cruise Control try to use REST API as for example the Cruise Control UI project does in order to show such information. Users could have their own tool to monitor this kind of info via the REST endpoints.

Said that, I am a maintainer of the Strimzi project (an operator to run Apache Kafka on Kubernetes) which has a full integration with Cruise Control and I am working on bringing the self-healing feature onboard. I am using a "custom" notifier to notify the operator that an anomaly was detected and the fix started but then, for the operator it seems not that simple to know that the fix was done (while it's simple when the user starts a rebalance and we can use the user_tasks endpoint).
The only viable way I found is querying the executor endpoint (as you also suggested above) but it provides you the progress, so to understand that a fix was done I have to check that the state is NO_TAST_IN_PROGRESS or the triggeredSelfHealingTaskId is empty or different (another anomaly is going to be fixed, so the previous one was done). Also the anomaly_detector endpoint doesn't provide a state like FIX_DONE, but just FIX_STARTED as mentioned before.

I was also proposing to have the notifier to be called when an anomaly was fixed. It needs the corresponding Anomaly interface being updated (which is anyway "evolving"). It doesn't look like to be a big effort because we already have a notifier reference in the AnomalyDetectorState to be called when the anomaly is fixed (via markSelfHealingFinished method).

ppatierno · 2024-11-29T15:16:40Z

@CCisGG I guess this issue didn't get enough interest from the Cruise Control maintainers?

CCisGG · 2024-12-03T04:03:14Z

Hi @ppatierno sorry for the delay. I had initialized a discussion for this issue within the team and we haven't reached any conclusion yet. I personally think your proposal has solid reasons, but I'm still hesitate since it may break existing users who depend on cruise control.

CCisGG · 2024-12-03T04:06:26Z

And practically I think the review for this change may also take a long time. To unblock your case, I think a better idea might be relying on the logs for now. Please also feel free to add more useful logs and I think it would be easier to review and accepted.

ppatierno · 2024-12-03T08:21:51Z

but I'm still hesitate since it may break existing users who depend on cruise control.

If you are referring to the addition to the AnomalyNotifier interface, not sure what it should break by adding a new method. Anyway it's marked as "evolving" on the repo so it states that users should expect breaking changes.

And practically I think the review for this change may also take a long time.

Of course it will take the time it deserves as any other suggestions. I am running the Strimzi project and I can understand that. Not all proposals and PRs are merged straight away.

To unblock your case, I think a better idea might be relying on the logs for now. Please also feel free to add more useful logs and I think it would be easier to review and accepted.

It's not a solution. Within Strimzi, Cruise Control is used in a more automated way by the Strimzi operator and looking at the log means a human doing that. I think that in a cloud-native environment, Cruise Control should aim to have more automation facilitating interaction with operators and not humans. This was my goal.

CCisGG · 2024-12-03T17:43:57Z

If you are referring to the addition to the AnomalyNotifier interface, not sure what it should break by adding a new method. Anyway it's marked as "evolving" on the repo so it states that users should expect breaking changes.

It will break the build for people who implement the interface but not implement the method.

It's not a solution. Within Strimzi, Cruise Control is used in a more automated way by the Strimzi operator and looking at the log means a human doing that. I think that in a cloud-native environment, Cruise Control should aim to have more automation facilitating interaction with operators and not humans. This was my goal.

It makes sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting progress and final status for a self-healing fix related to a detected anomaly #2215

Getting progress and final status for a self-healing fix related to a detected anomaly #2215

ppatierno commented Nov 5, 2024

ppatierno commented Nov 6, 2024 •

edited

Loading

ppatierno commented Nov 6, 2024

CCisGG commented Nov 6, 2024

ppatierno commented Nov 7, 2024

ppatierno commented Nov 29, 2024

CCisGG commented Dec 3, 2024

CCisGG commented Dec 3, 2024

ppatierno commented Dec 3, 2024

CCisGG commented Dec 3, 2024

Getting progress and final status for a self-healing fix related to a detected anomaly #2215

Getting progress and final status for a self-healing fix related to a detected anomaly #2215

Comments

ppatierno commented Nov 5, 2024

ppatierno commented Nov 6, 2024 • edited Loading

ppatierno commented Nov 6, 2024

CCisGG commented Nov 6, 2024

ppatierno commented Nov 7, 2024

ppatierno commented Nov 29, 2024

CCisGG commented Dec 3, 2024

CCisGG commented Dec 3, 2024

ppatierno commented Dec 3, 2024

CCisGG commented Dec 3, 2024

ppatierno commented Nov 6, 2024 •

edited

Loading