-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Volume without a target node after 1 node down in a 3 node cluster #1724
Comments
I am seeing this same error ( Just to provide some further details here, it looks like for the volumes that went down as a result of the node going down, the
|
Seems we had missed the first one @veenadong, sorry about that. @dcaputo-harmoni could you please share the volume attachments which reference this volume (if any) and also a support bundle? |
@tiagolobocastro Unfortunately I had to kill and restore the cluster right after this happened, and didn't get a chance to export the data you are looking for before I did. If it happens again I'll provide these details, thanks. I was running mayastor 2.7.0 and just upgraded to 2.7.1 when it rebuilt, and I know there are some stability improvements in there so am wondering if that might help. |
No problem, I guess keep an eye on it and should it happen again please let us know. |
I think this bug: #1747 (or a variation of this) explains what happens here. CSI and control-plane get out of sync, and the volume ends up not staying published and csi-node keeps trying to connect to a target which is not there.
@Abhinandan-Purkait @dsharma-dc any other thoughts here? |
This looks like the sequence of events that happened. old node = glop-nm-126-mem1.glcpdev.cloud.hpe.com 17:04:58 - Unpublish of the volume triggered as a result of node shutdown. - failed. (503 Service Unavailable)
[pod/mayastor-agent-core-88bc8d8b9-k7dcs/agent-core] at control-plane/agents/src/bin/core/controller/resources/operations_helper.rs:168 17:05:22 - A retry of failing unpublish happened and deleted the new target as the spec referenced new one. |
I'm running into this issue and am generating the support bundle now. I'm not sure if it has sensitive info in the bundle so I'd prefer not to post it publicly, is there somewhere I can send it that is private? |
To expand on this I noticed that 1/3 of my k3s worker nodes was down so I restarted it. After that is when I started seeing the OpenEBS issue, although I can't say for sure it's related. I can also upload the "volume attachments", but I'm not 100% sure what those are? The PV that is failing to attach was created dynamically via a PVC. The PVC is then mounted to a deployment, I can upload any of those manifests if that helps. |
Looks like I was able to work around this by scaling the replica down to 0 and then back to 1. Volume mounts successfully now. I'd still like to send the support bundle, let me know where I can send it. |
You can send it to [email protected] |
I sent sent the tar file via an email from my cwiggs.com domain. Let me know if there is anything else that will help with the issue. Thanks! |
Hi! |
We haven't received the email.
That would help, thank you |
@cwiggs How did you do this? What commands did you run? @tiagolobocastro Is there any way to get the volumes back when/if this happens? |
I keep getting a response from googlegroups.com that they weren't able to deliver the email since it has an attachment. I just sent it a 3rd time using Google Drive and so far it seems it went through.
|
@cwiggs, so what you did was to scale the OpenEBS deployment itself down to 0 and then up to 3? |
No just the deployment that is using OpenEBS for the PV and throwing this error. |
I've requested access to the google drive @cwiggs |
Hello, We also encounter this issue after a node reboot in our HA cluster with 3 nodes. Specifically, the node that restarts experiences a MountVolume failure with the following error:
The error logs from the csi-node show consistent failures:
This issue appears to affect stateful sets, particularly with the alertmanager-pgl-alertmanager in our setup. We've found that restarting the pod resolves the error, but we're looking for a more permanent solution. I'm willing to provide a support bundle to, but there's an issue with log collection because I have an OAuth2 proxy in front of Loki, and I haven't found a way to pass a token to kubectl mayastor dump system. If anyone knows how to handle this authentication issue it would be great ! :) . In the meantime, I'll try to work around this and send the bundle to you. |
Which version is this @nneram ? We've fixed a few issues on 2.7.2 that could be related to this. Though would require scaling application back down to 0 and then back up. |
I'm using Mayastor version 2.7.1 with the openebs chart version 4.1.1. My kubectl-mayastor plugin is at revision 399c96472dc3 (v2.7.2+0). I'll also try to email the bundle since I'm not comfortable sharing it here. |
Both are having the same/similar issue. |
2.7.0, take 1 node down from a 3 node cluster:
Pods are not able to attach the volume:
Attached is the system dump (note: the logs collection failed using the plugin, so capture the logs using a different method).
mayastor.log.gz
mayastor-2024-08-20--18-05-02-UTC.tar.gz
The text was updated successfully, but these errors were encountered: