Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional taints conflict with default taints and cause constant restarts #348

Open
jon-rei opened this issue Jan 17, 2025 · 3 comments · May be fixed by #354
Open

Additional taints conflict with default taints and cause constant restarts #348

jon-rei opened this issue Jan 17, 2025 · 3 comments · May be fixed by #354
Labels
bug Something isn't working

Comments

@jon-rei
Copy link

jon-rei commented Jan 17, 2025

/kind bug

What happened?

In our cluster we want to run the s3-csi-driver only on certain nodes using a nodeSelector. We can't use tolerateAllTaints because we are also running Cilium, which should be run with a startup taint so that pods are started after Cilium is initialised. So we are currently adding the node taint via the tolerations value.

What we discovered is that this seems to clash with the hardcoded tolerations when tolerateAllTaints is disabled. We can see that after 300s all the pods in the DaemonSet are restarted. This happens all the time.

What you expected to happen?

It would be great if the hardcoded tolerations could be disabled or overridden. A simple solution would be to move them from the template to the tolerations value.

How to reproduce it (as minimally and precisely as possible)?

Use the s3-csi-driver with tolerateAllTaints: false and any other additional tolerations.

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): EKS v1.31
  • Driver version: v1.11.0
@jon-rei jon-rei changed the title Additional taints conflict with default taints and leads to constant restarts Additional taints conflict with default taints and cause constant restarts Jan 17, 2025
@unexge
Copy link
Contributor

unexge commented Jan 17, 2025

Hey @jon-rei, thanks for reporting the issue. To ensure I understand the problem correctly:

Let's say you're tainting your nodes with:

$ kubectl taint nodes node1 key1=value1:NoExecute

That means any Pod that's not tolerating key1=value will be evicted from node1, and in order to prevent that you add tolerations to the CSI Driver Pods using node.tolerations Helm value:

$ helm upgrade --install aws-mountpoint-s3-csi-driver \
    ...
    # or anything equiliavent
    --set "node.tolerations[0].key=key1" \
    --set "node.tolerations[0].operator=Exists" \
    --set "node.tolerations[0].effect=NoExecute" \

but still the CSI Driver Pods gets evicted from node1 after 300 seconds due to default toleration:

- operator: Exists
  effect: NoExecute
  tolerationSeconds: 300

Is my understanding of the problem correct?

@jon-rei
Copy link
Author

jon-rei commented Jan 17, 2025

Hi @unexge,
yes, that's correct. After manually removing the hardcoded taint, the problem disappeared. But since we are using ArgoCD, a real solution would be great here. If you are open to PRs, I could create one myself.

@unexge
Copy link
Contributor

unexge commented Jan 17, 2025

Hey @jon-rei, I think providing a way to override default tolerations sounds reasonable. We recently made some changes in our CI to support creating PRs from forks, so hopefully we should be able to accept contributions now.

@unexge unexge added the bug Something isn't working label Jan 21, 2025
@jon-rei jon-rei linked a pull request Jan 21, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants