Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd remap on network failure #4712

Closed
clwluvw opened this issue Jul 15, 2024 · 8 comments
Closed

rbd remap on network failure #4712

clwluvw opened this issue Jul 15, 2024 · 8 comments
Labels
component/rbd Issues related to RBD dependency/k8s depends on Kubernetes features wontfix This will not be worked on

Comments

@clwluvw
Copy link
Member

clwluvw commented Jul 15, 2024

Describe the feature you'd like to have

After a node experiences a network outage all rbd images mounted on the node will lose the watcher and be remounted as read-only.

blk_update_request: I/O error, dev rbd3, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
Buffer I/O error on dev rbd3, logical block 0, lost sync page write
EXT4-fs (rbd3): I/O error while writing superblock
EXT4-fs (rbd3): Remounting filesystem read-only

Perhaps cephcsi can watch PVC and make sure they are remapped if the watcher is none or the file system is mounted as read-only while it shouldn't.

What is the value to the end user? (why is it a priority?)

Recover volumes after a network failure which doesn't affect the pods or anything else.

How will we know we have a good solution? (acceptance criteria)

After a failover from a network failure, mounted RBD images should be remapped and not be read-only if they shouldn't be.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jul 16, 2024

@clwluvw This is not possible with cephcsi as cephcsi is not kubernetes specific (even though we have some kubernetes log (which we are really planning to get rid of soon), This need to be done by some external operator

@nixpanic nixpanic added the component/rbd Issues related to RBD label Jul 16, 2024
@nixpanic
Copy link
Member

In order to re-map an RBD image, all users (applications) of the filesystem need to be restarted as well. It is cleaner for an application to report a failure and cause a restart of the container once such a problem happens.

Ceph-CSI has a health-checker that is currently only used with CephFS (see #4200). We had plans to extend that to RBD as well, but this has not been done. It would be possible to report that the volume is unhealthy to kubelet (in the NodeGetVolumeStats reply), which then can take actions on it (currently it only logs, so not much practical recovery yet).

@nixpanic nixpanic added the dependency/k8s depends on Kubernetes features label Jul 16, 2024
@clwluvw
Copy link
Member Author

clwluvw commented Jul 16, 2024

It would be possible to report that the volume is unhealthy to kubelet (in the NodeGetVolumeStats reply), which then can take actions on it (currently it only logs, so not much practical recovery yet).

I guess this would be the best option but it needs to be extended on k8s to restart the pod in case of volume failure.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jul 16, 2024

@clwluvw Yes that is correct but For now, i think you can add a check in the hook to ensure that pvc is writable if not restart the pod. this might work for RWO volumes not for RWX as all the pods need to be scaled down to 0 and scaled back

@clwluvw
Copy link
Member Author

clwluvw commented Jul 16, 2024

@Madhu-1 What do you mean by the hook? Do you mean the readinessProbe?

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jul 16, 2024

i mean to use Liveness probe which can check for the file in pvc and do restart the container

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the wontfix This will not be worked on label Aug 15, 2024
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rbd Issues related to RBD dependency/k8s depends on Kubernetes features wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants