Notifications not firing for Analysis Run fails when Analysis Run is part of an Expriment #4009

meeech · 2024-12-18T02:45:40Z

I am starting this ticket to capture the information as I investigate and try to resolve this issue.
If anyone has any pointers or thoughts, please add them.

When an analysis run fail happens and that analysis run is part of an inline experiment step, we don't get the on-analysis-run-error or on-analysis-run-fail notification.

Analysis Run Error
✅ Background Analysis Run: event: AnalysisRunError object: rollout/basic-rollout
❌ Inline Step Analysis Run: event: AnalysisRunError object: experiment/basic-rollout-exp-steps-b66774df5-3-0

We get the RolloutAborted notification for both, because the event that fires belongs to the rollout/* object in both cases

Analysis Run Fail
✅ Background Analysis Run: event: AnalysisRunFailed object: rollout/basic-rollout
❌ Inline Step Analysis Run: event: AnalysisRunFailed object: experiment/basic-rollout-exp-steps-bd7bdfcc8-4-0

We get the RolloutAborted notification for both, because the event that fires belongs to the rollout/* object in both cases

So this has me thinking theres a few possible options:

with the notif engine, we don't give it access to the experiment object events?
Or is this a case where the events are being fired off the wrong object - where we use the experiment EventRecorder, when we should find the parent(?) rollout object and use its EventRecorder?

I'll keep digging. Unsure what the ideal would be:
would we like something like on-experiment-analysis-run-failed, on-experiment-analysis-run-error... or would things be better served with them using the already existing triggers? I think when its a step it would make sense to use the existing triggers, and have the rollout object available for the templates, but what about stand alone experiments?

Version

1.7.2 (but this has existed as a problem as long as I've been using experiment step, so at least 1.5/1.6

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

meeech · 2024-12-19T03:24:48Z

More rough notes from a conversation with @zachaller:

Rollouts uses the notification engine in 2 ways

As a library (handles on-event triggers)
As a controller (handles when triggers)

Path to explore

Create a new notification controller / deployment - this is similar to what Argo CD does.

Implementing a separate notification controller as it's own deployment you would have to recreate an event listener translator function like the one in the rollouts controller (https://github.com/argoproj/argo-rollouts/blob/master/utils%2Frecord%2Frecord.go#L373)
By default notification engine just add a k8s watch to a kind then runs the evaluation engine on it, within rollouts we have this on k8s event system that fires notifications via code not from the informer
ArgoCD does not work that way only Rollouts does which somewhat makes sense for ease of use.
Upstream notification engine doesn't support multiple kinds in one controller, so would maybe require multiple config maps as well to config
It may be easier path to make a new controller. (example of making a new one with notif engine https://github.com/argoproj/notifications-engine/blob/master/examples%2Fcertmanager%2Fcontroller%2Fmain.go)

meeech added the bug Something isn't working label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notifications not firing for Analysis Run fails when Analysis Run is part of an Expriment #4009

Notifications not firing for Analysis Run fails when Analysis Run is part of an Expriment #4009

meeech commented Dec 18, 2024

meeech commented Dec 19, 2024

Notifications not firing for Analysis Run fails when Analysis Run is part of an Expriment #4009

Notifications not firing for Analysis Run fails when Analysis Run is part of an Expriment #4009

Comments

meeech commented Dec 18, 2024

meeech commented Dec 19, 2024