Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle context canceled errors on shutdown #2843

Open
romulets opened this issue Dec 17, 2024 · 0 comments · May be fixed by #2936
Open

Handle context canceled errors on shutdown #2843

romulets opened this issue Dec 17, 2024 · 0 comments · May be fixed by #2936
Assignees
Labels
Feature:Cloud-Security Cloud Security related features Team:Cloud Security Cloud Security team related technical debt
Milestone

Comments

@romulets
Copy link
Member

Cloudbeat can any moment receive a context cancel. Right now, once a context cancel happens, we log errors from multiple different places of cloudbeat.

That is specially troublesome in agentless where pods tend be restarted/deleted with more frequency than a standard agent based solution. On top of that, in agentless we are paged based on amount of errors, and a cloudbeat shutdown during a cycle might alert the engineer on duty (urgency low, example).

Image

The error logging is spread through the code, and we can't just unifying all errors and raise them up because some of them are "optional" errors (we log them but doesn't stop the execution). Example.

Ideally we find a strategy to not have any alert in such a scenario, because the context canceled on a shutdown is something that a oncaller has nothing to act upon, therefore is a false positive.

There are two directions we could see us going with:

  1. From cloudbeat, we could write a wrapper or handler around logp to receive the error and check, if context canceled lower the level to warn (or whatever else we decide). Or we could case per case, what would be very repetitive.

  2. Don't alert in case of pods shutdown or restart. That might be tricky to configure and might hide a legit issue. But the fact is that once a pod is shut down there is nothing a oncaller can do. There is no customer impact. There is nothing to fix - the pod is gone. So should we alert on non actionable problems?

@romulets romulets added Feature:Cloud-Security Cloud Security related features Team:Cloud Security Cloud Security team related labels Dec 17, 2024
@romulets romulets changed the title Handle context canceled errors Handle context canceled errors on shutdown Dec 17, 2024
@oren-zohar oren-zohar added this to the 8.18 milestone Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Cloud-Security Cloud Security related features Team:Cloud Security Cloud Security team related technical debt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants