rbd remap on network failure #4712

clwluvw · 2024-07-15T20:53:11Z

Describe the feature you'd like to have

After a node experiences a network outage all rbd images mounted on the node will lose the watcher and be remounted as read-only.

blk_update_request: I/O error, dev rbd3, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
Buffer I/O error on dev rbd3, logical block 0, lost sync page write
EXT4-fs (rbd3): I/O error while writing superblock
EXT4-fs (rbd3): Remounting filesystem read-only

Perhaps cephcsi can watch PVC and make sure they are remapped if the watcher is none or the file system is mounted as read-only while it shouldn't.

What is the value to the end user? (why is it a priority?)

Recover volumes after a network failure which doesn't affect the pods or anything else.

How will we know we have a good solution? (acceptance criteria)

After a failover from a network failure, mounted RBD images should be remapped and not be read-only if they shouldn't be.

The text was updated successfully, but these errors were encountered:

Madhu-1 · 2024-07-16T07:56:32Z

@clwluvw This is not possible with cephcsi as cephcsi is not kubernetes specific (even though we have some kubernetes log (which we are really planning to get rid of soon), This need to be done by some external operator

nixpanic · 2024-07-16T09:10:22Z

In order to re-map an RBD image, all users (applications) of the filesystem need to be restarted as well. It is cleaner for an application to report a failure and cause a restart of the container once such a problem happens.

Ceph-CSI has a health-checker that is currently only used with CephFS (see #4200). We had plans to extend that to RBD as well, but this has not been done. It would be possible to report that the volume is unhealthy to kubelet (in the NodeGetVolumeStats reply), which then can take actions on it (currently it only logs, so not much practical recovery yet).

clwluvw · 2024-07-16T10:12:40Z

It would be possible to report that the volume is unhealthy to kubelet (in the NodeGetVolumeStats reply), which then can take actions on it (currently it only logs, so not much practical recovery yet).

I guess this would be the best option but it needs to be extended on k8s to restart the pod in case of volume failure.

Madhu-1 · 2024-07-16T10:22:12Z

@clwluvw Yes that is correct but For now, i think you can add a check in the hook to ensure that pvc is writable if not restart the pod. this might work for RWO volumes not for RWX as all the pods need to be scaled down to 0 and scaled back

clwluvw · 2024-07-16T10:35:53Z

@Madhu-1 What do you mean by the hook? Do you mean the readinessProbe?

Madhu-1 · 2024-07-16T10:42:00Z

i mean to use Liveness probe which can check for the file in pvc and do restart the container

github-actions · 2024-08-15T21:01:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions · 2024-08-22T21:02:30Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

nixpanic added the component/rbd Issues related to RBD label Jul 16, 2024

nixpanic added the dependency/k8s depends on Kubernetes features label Jul 16, 2024

github-actions bot added the wontfix This will not be worked on label Aug 15, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rbd remap on network failure #4712

rbd remap on network failure #4712

clwluvw commented Jul 15, 2024

Madhu-1 commented Jul 16, 2024

nixpanic commented Jul 16, 2024

clwluvw commented Jul 16, 2024

Madhu-1 commented Jul 16, 2024

clwluvw commented Jul 16, 2024

Madhu-1 commented Jul 16, 2024

github-actions bot commented Aug 15, 2024

github-actions bot commented Aug 22, 2024

rbd remap on network failure #4712

rbd remap on network failure #4712

Comments

clwluvw commented Jul 15, 2024

Describe the feature you'd like to have

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Madhu-1 commented Jul 16, 2024

nixpanic commented Jul 16, 2024

clwluvw commented Jul 16, 2024

Madhu-1 commented Jul 16, 2024

clwluvw commented Jul 16, 2024

Madhu-1 commented Jul 16, 2024

github-actions bot commented Aug 15, 2024

github-actions bot commented Aug 22, 2024