Additional taints conflict with default taints and cause constant restarts #348

jon-rei · 2025-01-17T08:26:22Z

/kind bug

What happened?

In our cluster we want to run the s3-csi-driver only on certain nodes using a nodeSelector. We can't use tolerateAllTaints because we are also running Cilium, which should be run with a startup taint so that pods are started after Cilium is initialised. So we are currently adding the node taint via the tolerations value.

What we discovered is that this seems to clash with the hardcoded tolerations when tolerateAllTaints is disabled. We can see that after 300s all the pods in the DaemonSet are restarted. This happens all the time.

What you expected to happen?

It would be great if the hardcoded tolerations could be disabled or overridden. A simple solution would be to move them from the template to the tolerations value.

How to reproduce it (as minimally and precisely as possible)?

Use the s3-csi-driver with tolerateAllTaints: false and any other additional tolerations.

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version): EKS v1.31
Driver version: v1.11.0

The text was updated successfully, but these errors were encountered:

unexge · 2025-01-17T16:29:22Z

Hey @jon-rei, thanks for reporting the issue. To ensure I understand the problem correctly:

Let's say you're tainting your nodes with:

$ kubectl taint nodes node1 key1=value1:NoExecute

That means any Pod that's not tolerating key1=value will be evicted from node1, and in order to prevent that you add tolerations to the CSI Driver Pods using node.tolerations Helm value:

$ helm upgrade --install aws-mountpoint-s3-csi-driver \
    ...
    # or anything equiliavent
    --set "node.tolerations[0].key=key1" \
    --set "node.tolerations[0].operator=Exists" \
    --set "node.tolerations[0].effect=NoExecute" \

but still the CSI Driver Pods gets evicted from node1 after 300 seconds due to default toleration:

- operator: Exists
  effect: NoExecute
  tolerationSeconds: 300

Is my understanding of the problem correct?

jon-rei · 2025-01-17T17:04:57Z

Hi @unexge,
yes, that's correct. After manually removing the hardcoded taint, the problem disappeared. But since we are using ArgoCD, a real solution would be great here. If you are open to PRs, I could create one myself.

unexge · 2025-01-17T17:25:44Z

Hey @jon-rei, I think providing a way to override default tolerations sounds reasonable. We recently made some changes in our CI to support creating PRs from forks, so hopefully we should be able to accept contributions now.

jon-rei changed the title ~~Additional taints conflict with default taints and leads to constant restarts~~ Additional taints conflict with default taints and cause constant restarts Jan 17, 2025

unexge added the bug Something isn't working label Jan 21, 2025

jon-rei linked a pull request Jan 21, 2025 that will close this issue

Allow overriding of hardcoded DaemonSet tolerations #354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional taints conflict with default taints and cause constant restarts #348

Additional taints conflict with default taints and cause constant restarts #348

jon-rei commented Jan 17, 2025

unexge commented Jan 17, 2025

jon-rei commented Jan 17, 2025

unexge commented Jan 17, 2025

Additional taints conflict with default taints and cause constant restarts #348

Additional taints conflict with default taints and cause constant restarts #348

Comments

jon-rei commented Jan 17, 2025

unexge commented Jan 17, 2025

jon-rei commented Jan 17, 2025

unexge commented Jan 17, 2025